Olist Sales Prediction¶

Olist was founded in 2015 and was relatively new business when they released their dataset on Kaggel. We have observations from Sept 2016 to Aug 2018. For a relatively new business it is very critical to understand how their business is growing and how they can be prepared to scale and maintain it further.

Keeping this in mind, we are trying to forecast the sales for Olist so that they can channelize their resources for stock preparation, marketing campaigns and get better prepared for future.

Index¶

1. Importing Libraries </br> 2. Importing Data</br> $\;\;\;\;$ 2.1 Data dictionary </br> $\;\;\;\;$ 2.2 Processing data for time series </br> 3. Explorartory Data Analysis</br> $\;\;\;\;$ 3.1 General exploration </br> $\;\;\;\;$ 3.2 Decomposing time series </br> $\;\;\;\;$ 3.3 Checking stationarity </br> 4. Preparation for Modeling</br> $\;\;\;\;$ 4.1 Train and test split </br> $\;\;\;\;$ 4.2 Defining functions for plotting predictions and forecast </br> $\;\;\;\;$ 4.3 Defining functions for evaluation </br> 5. Modelling (SARIMA) </br> $\;\;\;\;$ 5.1 Plotting ACF and PACF plot </br> $\;\;\;\;$ 5.2 Applying SARIMA model </br> $\;\;\;\;$ 5.3 Plotting predictions and evaluating SARIMA model </br> $\;\;\;\;$ 5.4 Adding Exogenous variable holiday for SARIMAX </br> $\;\;\;\;$ 5.5 Applying grid search on SARIMAX with exogenous variable </br> $\;\;\;\;$ 5.6 Plotting predictions and evaluating SARIMAX model </br> 6. Modelling (FB Prophet) </br> $\;\;\;\;$ 6.1 Preparing data for FB Prophet </br> $\;\;\;\;$ 6.2 Applying a Baseline FB Prophet </br> $\;\;\;\;$ 6.3 Plotting and Evaluating Baseline model </br> $\;\;\;\;$ 6.4 Tuning FB Prophet using Grid Search </br> $\;\;\;\;$ 6.5 Plotting and evaluating Tuned FB Prophet </br> 7. Modelling (XG Boost Regression) </br> $\;\;\;\;$ 7.1 Preparing data for XG Boost Regression </br> $\;\;\;\;$ 7.2 Baseline model on daily data and its performance </br> $\;\;\;\;$ 7.3 Tuning XG Boost and evaluating and its performance </br> 8. Modelling (LSTM) </br> $\;\;\;\;$ 8.1 Preparing data for LSTM </br> $\;\;\;\;$ 8.2 Baseline model on daily data and its performance </br> 9. Discussing issues encounted with hourly sampled data </br> 10. Conclusion

1. Importing Libraries¶

In [1]:
#importing basic libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme(style= 'darkgrid' )

#some built in functions
import itertools
from datetime import datetime, timedelta

import warnings
#to supress warning generated for fb prophet using .append method by default
# warnings.simplefilter(action='ignore', category=FutureWarning) 
#to suppress warnings in Sarima model
# warnings.simplefilter(action='ignore', category=UserWarning)
warnings.filterwarnings("ignore")
from statsmodels.tools.sm_exceptions import ConvergenceWarning
warnings.simplefilter('ignore', ConvergenceWarning)
from statsmodels.tools.sm_exceptions import InterpolationWarning
warnings.simplefilter('ignore', InterpolationWarning)

#importing high level interactive plotting libraries
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objs as go

# importing time series stats model
import statsmodels.api as sm
from statsmodels.api import tsa 
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.stattools import adfuller, kpss
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

#importing Sarimax
from statsmodels.tsa.statespace.sarimax import SARIMAX

#sklearn library
from sklearn.preprocessing import MinMaxScaler

#importing fb prophet
from fbprophet import Prophet
from fbprophet.diagnostics import cross_validation
from fbprophet.plot import plot_plotly

#importing XG Boost 
import xgboost as xgb
from xgboost import plot_importance, plot_tree

#importing LSTM libraries
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, LSTM, BatchNormalization
from tensorflow.keras.optimizers import Adam
from keras.callbacks import EarlyStopping

2. Importing Data¶

In [2]:
#loading the dataset
master=pd.read_csv('data_cleaned/master_dataset.csv')
In [3]:
#reading the head
master.head()
Out[3]:
order_id customer_id order_purchase_timestamp order_estimated_delivery_date qty product_id seller_id shipping_limit_date price freight_value ... seller_state seller_lat seller_lng customer_unique_id customer_city customer_state customer_lat customer_lng review_score total_amount
0 e481f51cbdc54678b7cc49136f2d6af7 9ef432eb6251297304e76186b10a928d 2017-10-02 10:56:33 2017-10-18 1 87285b34884572647811a353c7ac498a 3504c0cb71d7fa48d967e0e4c94d59d9 2017-10-06 11:07:15 29.99 8.72 ... SP -23.680862 -46.444311 7c396fd4830fd04220f754e42b4e5bff sao paulo SP -23.577482 -46.587077 4 29.99
1 128e10d95713541c87cd1a2e48201934 a20e8105f23924cd00833fd87daa0831 2017-08-15 18:29:31 2017-08-28 1 87285b34884572647811a353c7ac498a 3504c0cb71d7fa48d967e0e4c94d59d9 2017-08-21 20:05:16 29.99 7.78 ... SP -23.680862 -46.444311 3a51803cc0d012c3b5dc8b7528cb05f7 sao paulo SP -23.564636 -46.534401 4 29.99
2 0e7e841ddf8f8f2de2bad69267ecfbcf 26c7ac168e1433912a51b924fbd34d34 2017-08-02 18:24:47 2017-08-15 1 87285b34884572647811a353c7ac498a 3504c0cb71d7fa48d967e0e4c94d59d9 2017-08-08 18:37:31 29.99 7.78 ... SP -23.680862 -46.444311 ef0996a1a279c26e7ecbd737be23d235 sao paulo SP -23.600462 -46.655318 5 29.99
3 bfc39df4f36c3693ff3b63fcbea9e90a 53904ddbea91e1e92b2b3f1d09a7af86 2017-10-23 23:26:46 2017-11-13 1 87285b34884572647811a353c7ac498a 3504c0cb71d7fa48d967e0e4c94d59d9 2017-10-31 02:14:11 29.99 14.10 ... SP -23.680862 -46.444311 e781fdcc107d13d865fc7698711cc572 florianopolis SC -27.546553 -48.497691 3 29.99
4 53cdb2fc8bc7dce0b6741e2150273451 b0830fb4747a6c6d20dea0b8c802d7ef 2018-07-24 20:41:37 2018-08-13 1 595fac2a385ac33a80bd5114aec74eb8 289cdb325fb7e7f891c38608bf9e0962 2018-07-30 03:24:27 118.70 22.76 ... SP -19.807885 -43.980818 af07308b275d755c9edb36a90c618231 barreiras BA -12.186877 -44.540232 4 118.70

5 rows × 29 columns

In [4]:
#understanding the datatypes
master.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110013 entries, 0 to 110012
Data columns (total 29 columns):
 #   Column                         Non-Null Count   Dtype  
---  ------                         --------------   -----  
 0   order_id                       110013 non-null  object 
 1   customer_id                    110013 non-null  object 
 2   order_purchase_timestamp       110013 non-null  object 
 3   order_estimated_delivery_date  110013 non-null  object 
 4   qty                            110013 non-null  int64  
 5   product_id                     110013 non-null  object 
 6   seller_id                      110013 non-null  object 
 7   shipping_limit_date            110013 non-null  object 
 8   price                          110013 non-null  float64
 9   freight_value                  110013 non-null  float64
 10  product_name_lenght            110013 non-null  float64
 11  product_description_lenght     110013 non-null  float64
 12  product_photos_qty             110013 non-null  float64
 13  product_weight_g               110013 non-null  float64
 14  product_length_cm              110013 non-null  float64
 15  product_height_cm              110013 non-null  float64
 16  product_width_cm               110013 non-null  float64
 17  product_category_name_english  110013 non-null  object 
 18  seller_city                    110013 non-null  object 
 19  seller_state                   110013 non-null  object 
 20  seller_lat                     110013 non-null  float64
 21  seller_lng                     110013 non-null  float64
 22  customer_unique_id             110013 non-null  object 
 23  customer_city                  110013 non-null  object 
 24  customer_state                 110013 non-null  object 
 25  customer_lat                   110013 non-null  float64
 26  customer_lng                   110013 non-null  float64
 27  review_score                   110013 non-null  int64  
 28  total_amount                   110013 non-null  float64
dtypes: float64(14), int64(2), object(13)
memory usage: 24.3+ MB
In [5]:
#checking if there is any null value
master.isnull().sum()
Out[5]:
order_id                         0
customer_id                      0
order_purchase_timestamp         0
order_estimated_delivery_date    0
qty                              0
product_id                       0
seller_id                        0
shipping_limit_date              0
price                            0
freight_value                    0
product_name_lenght              0
product_description_lenght       0
product_photos_qty               0
product_weight_g                 0
product_length_cm                0
product_height_cm                0
product_width_cm                 0
product_category_name_english    0
seller_city                      0
seller_state                     0
seller_lat                       0
seller_lng                       0
customer_unique_id               0
customer_city                    0
customer_state                   0
customer_lat                     0
customer_lng                     0
review_score                     0
total_amount                     0
dtype: int64
In [6]:
print(f" The number of unique orders : {master['order_id'].nunique()}")
print(f" The number of unique customers : {master['customer_unique_id'].nunique()}")
print(f" The total number of customer transactions specified by customer ids :  {master['customer_id'].nunique()}")
print(f" Overall {round((master['customer_unique_id'].nunique()/master['customer_id'].nunique())*100, 2)} % transcations \
are made by new customers and {round((1 -(master['customer_unique_id'].nunique()/master['customer_id'].nunique()))*100,2)} % \
are made by repeat customers.")
print(f" The number of unique sellers {master['seller_id'].nunique()}")
print(f" The number of unique products sold at Olist platform {master['product_id'].nunique()}")
 The number of unique orders : 95832
 The number of unique customers : 92755
 The total number of customer transactions specified by customer ids :  95832
 Overall 96.79 % transcations are made by new customers and 3.21 % are made by repeat customers.
 The number of unique sellers 2965
 The number of unique products sold at Olist platform 32072

2.1 Data dictionary¶

We have a total of 110013 rows or orders with 28 features that describe the order made at the olist platform. It specifies the product category bought with qunatity purchased, unit price, purchase time stamp, delivery details, review score and customer and seller information.

  • order_id : Specifies the unique order. We have 95832 unique orders. An order_id can repeat in the dataframe but it will have another product category and number of items bought in that category.
  • customer_id: Specifies the customer id for the order. We have 95832 customer ids associated with the order_id.
  • order_purchase_timestamp : The time stamp for the order. It includes date and time.
  • order_estimated_delivery_date : Estimated delivery date at the time of purchase.
  • qty : Number of items bought in a product category
  • product_id : This specify the actual product in a product category. We have 32072 unique products within 74 overall product categories.
  • seller_id : We have 2965 unique sellers.
  • shipping_limit_date : This date is informs seller of the shipping limit so they can dispatch the order at the earliest. - price : Unit price for each product.
  • freight_value : The freight charges based on product weight and dimension. This value is for one item. If there are three items the total freight will be equal to three times the freight_value.
  • product_name_lenght : Number of characters extracted from the product name.
  • product_description_lenght : Number of characters extracted from the product description.
  • product_photos_qty : Number of product published photos.
  • product_weight_g : Product weight measured in grams.
  • product_length_cm : Product length measured in centimeters.
  • product_height_cm : Product height measured in centimeters.
  • product_width_cm : Product width measured in centimeters.
  • product_category_name_english : English names of product categories.
  • seller_city : It is the city where seller is located.
  • seller_state : It is the state where seller is located.
  • seller_lat : It is the latitude of seller location.
  • seller_lng : : It is the longitude of seller location.
  • customer_unique_id : There are 92755 unique customers which make up 96.79 % of the total customers in database. Only 3.21% of the customers are repeat customers. It may be because the data we have is the initial data when Olist had just started its business and therefore we have all the new customers in the database.
  • customer_city : It is the city where customer is located.
  • customer_state : It is the state where customer is located.
  • customer_lat : It is the latitude of customer location.
  • customer_lng : It is the longitude of customer location.
  • review_score : Reviews submitted by the customers range from 1-5.

Target Variable : total_amount : We have calculated this value after multiplying qty and price. This is the actual sales amount important for the business. We will be predicting sales amount to help business prepare for the the future.

Note: We have not considered freight charges in the calculation of 'total_amount' beacuse we found that when olist started its business it was outsourcing the logistics to third party and therefore we want to give business insight of only the sales from the products sold at the Olist platform.

We also found that Olist had accquired PAX, its logistic partner later in the year 2020, check here here for more details.

2.2 Processing data for time series¶

We have seen that the 'order_purchase_timestamp' has incorrect format. We will start with converting this column to date-time format and we will try to extract some features from dates for analysis.

In [7]:
#converting date columns which are in object format to datetime format
master['order_purchase_timestamp']=pd.to_datetime(master['order_purchase_timestamp'])

We can extract year, date, moth , weekday and day information from the dates.

In [8]:
#converting date columns which are in object format to datetime format
master['purchase_year']=pd.to_datetime(master['order_purchase_timestamp']).dt.year
master['purchase_month']=pd.to_datetime(master['order_purchase_timestamp']).dt.month
master['purchase_MMYYYY']=pd.to_datetime(master['order_purchase_timestamp']).dt.strftime('%m-%Y')
master['purchase_week']=pd.to_datetime(master['order_purchase_timestamp']).dt.isocalendar().week
master['purchase_dayofweek']=pd.to_datetime(master['order_purchase_timestamp']).dt.weekday
master['purchase_dayofmonth']=pd.to_datetime(master['order_purchase_timestamp']).dt.day

We will aggregate the total_amount by dates so that we can get a time series, meaning a dataframe with the total_amount column arranged in order as per dates. We will set the dates as index.

In [9]:
# Creating new DataFrame with daily frequency and number of orders
df_agg = master.groupby(pd.Grouper(key='order_purchase_timestamp', freq='D'))['total_amount'].sum().reset_index()
df_agg.set_index('order_purchase_timestamp', inplace=True)
df_agg.index.freq = 'D' # To keep pandas inference in check!

#reading top five rows
print(df_agg.head())

# checking the mean, max and count values.
print(df_agg.describe())
                          total_amount
order_purchase_timestamp              
2016-09-15                      269.94
2016-09-16                        0.00
2016-09-17                        0.00
2016-09-18                        0.00
2016-09-19                        0.00
        total_amount
count     714.000000
mean    20881.443333
std     16038.400478
min         0.000000
25%      9275.570000
50%     19772.815000
75%     30819.712500
max    184834.170000
In [10]:
#checking the info
df_agg.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 714 entries, 2016-09-15 to 2018-08-29
Freq: D
Data columns (total 1 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   total_amount  714 non-null    float64
dtypes: float64(1)
memory usage: 11.2 KB
In [11]:
#checking head
df_agg.head()
Out[11]:
total_amount
order_purchase_timestamp
2016-09-15 269.94
2016-09-16 0.00
2016-09-17 0.00
2016-09-18 0.00
2016-09-19 0.00
In [12]:
#index start
df_agg.head(1).index
Out[12]:
DatetimeIndex(['2016-09-15'], dtype='datetime64[ns]', name='order_purchase_timestamp', freq='D')
In [13]:
#index end
df_agg.tail(1).index
Out[13]:
DatetimeIndex(['2018-08-29'], dtype='datetime64[ns]', name='order_purchase_timestamp', freq='D')

We have a total of 714 observations staring form '2016-09-15' till '2018-08-29'.

3. Explorartory Data Analysis¶

Now that we have our original master data and time series data, we will try to explore some high level features of our master data and will go a level deeper for our time series.

3.1 General exploration¶

We can plot a heat map to see which numerical features are highly correlated with the total_amount. This is just a high level overview to see the which features can impact sales and also the correlation among the features.

In [14]:
#canvas size
plt.figure(figsize=(18,12))
#correlation between all columns
corr_df= master.select_dtypes(include=np.number).corr()
# creating mask
mask = np.triu(np.ones_like(corr_df.corr()))
# plotting a triangle correlation heatmap
sns.heatmap(corr_df, cmap="RdPu", annot=True, mask=mask) 
Out[14]:
<AxesSubplot: >

Observation:¶

  • We can see that total_amount ios highly correlated with price. This is obvious because we know that total_amount was calculated using price.
  • purchase_week and purchase_month are highly correlated.
  • product_weight and freight values are positively correlated as frieght is calaculated as per product weight.
  • We don't see any other feature standing out to have high correlation with total_amount.
In [15]:
fig=px.histogram(df_agg, x='total_amount')
fig.update_layout(
    yaxis_title="Frequency",
    xaxis_title="Revenue in Brazilian Real",
    legend_title="", 
    title="Daily Revenue distribution from Sept 2016 to Aug 2018")
fig.show()

Observations:¶

  • There is a peak at zero amount because we don't have any observation for most of the days in 2016.
  • If we ignore that, our overall distribution is normal with some outliers at the right side. These observations are from the peak sales time.
In [16]:
#we are creating pivot table with three index year, month and string MMYYY and getting sum of total_amount and number of orders
sales_df=master.pivot_table(values = ['order_id', 'total_amount']
                              , index=['purchase_year','purchase_month','purchase_MMYYYY']
                              , aggfunc={'order_id':'nunique', 'total_amount':'sum'})
from plotly.subplots import make_subplots

#to plot number of orders by MMYYY
trace1 = go.Bar(
    x=sales_df.index.get_level_values(2),
    y=sales_df['order_id'],
    name='Orders',
    marker=dict(
        color='rgb(34,163,192)'
               )
)
#to plot sum of total_amounts by MMYYY
trace2 = go.Scatter(
    x=sales_df.index.get_level_values(2),
    y=sales_df['total_amount'],
    name='Revenue',
    yaxis='y2' #using a right side y-axis

)

fig = make_subplots(specs=[[{"secondary_y": True}]])
fig.add_trace(trace1)
fig.add_trace(trace2,secondary_y=True)
fig['layout'].update(height = 600, width = 1000, title = 'Revenue and Orders by Month-Year',xaxis=dict(
      tickangle=-90, title='Purchase Month'), yaxis=dict(title= 'Number of Orders'), yaxis2=dict(title= 'Revenue amount'))
fig.show()

We can see that both order made and total_amount are growing. There are some peaks that reappear after some month. The highes peak was recorded for Nov 2017.

In [17]:
print(f"The overall revenue earned as of Aug 2018 is {df_agg['total_amount'].sum()} Brazilian Real.")
The overall revenue earned as of Aug 2018 is 14909350.540000001 Brazilian Real.
In [18]:
df_agg.groupby(df_agg.index.year).sum()
Out[18]:
total_amount
order_purchase_timestamp
2016 45756.84
2017 6722987.98
2018 8140605.72
In [19]:
(6722987.98-45756.84)/6722987.98
Out[19]:
0.9931939726597577
In [20]:
(8140605.72-6722987.98)/8140605.72
Out[20]:
0.17414155515690538
In [21]:
#plotting daily data to get high level picture

fig = px.line(df_agg, x=df_agg.index, y='total_amount')

# axis labels and title
fig.update_layout(
    yaxis_title="Total Revenue earned (Brazilian Real)", 
    legend_title="", 
    title="Daily Revenue from Sept 2016 to Aug 2018"
)

# activate slider
fig.update_xaxes(rangeslider_visible=True)

#annotate peak
fig.add_annotation(x='2017-11-24', y= 184834.17, text=f'Black Friday Sale', yanchor='bottom', 
                   showarrow=True, arrowhead=1, arrowsize=1, arrowwidth=2, arrowcolor="#636363",
                   ax=-20, ay=-30, font=dict(size=15, color="green", family="Courier New, monospace"),
                   align="left", bordercolor="green", borderwidth=2, bgcolor="#CFECEC", opacity=0.8)

fig.show()

Observations:¶

  • We can see that there is somewhat positive trend.
  • There is almost zero sales after October 10 2016 till Jan 2017. There is a possibility that this data was captured for intial test phase.
  • It looks like the business started rolling in from Jan 1 2017.
  • There is a highest spike on Nov 24 2017 because it was black friday.
  • There is another spike on Sept 29 2017 but I could not find the reason for it.
  • Seasonality is not clear at this point.

Removing the observations before Jan 1 2017¶

Removing the data before Jan 01 2017 because there is are a lot of consecutive days with zero sales. Including them in our model may impact our forecasting.

It may be because the period of Sept 2016 to Dec 2016 was an experimental phase. We can find that we have continuous sales after Jan 2017.

In [22]:
#removing the rows before Jan 01 2017
daily_data=df_agg.loc[df_agg.index>='2017-01-01', :]
In [23]:
daily_data.tail()
Out[23]:
total_amount
order_purchase_timestamp
2018-08-25 10891.40
2018-08-26 8526.19
2018-08-27 5542.90
2018-08-28 4088.37
2018-08-29 2670.54
In [24]:
print(f"The total number of datapoint to work on :{daily_data.shape[0]}")
The total number of datapoint to work on :606

This is challenging now to work with limited data to forecast Sales.

In [25]:
#saving the start and end dates separately
start_date=daily_data.index[0]
end_date=daily_data.index[-1]
In [26]:
from plotly.subplots import make_subplots

df1=master.groupby('purchase_week')['total_amount'].sum()
df2=master.groupby('purchase_dayofmonth')['total_amount'].sum()

fig=make_subplots(rows=1, cols=2, subplot_titles=("Revenue by weeks", "Revenue by days of month"))

fig.add_trace(go.Scatter(name='Weekly', x=df1.index, y=df1.values), row=1, col=1 )

fig.add_trace(go.Scatter(name='Monthly', x=df2.index, y=df2.values), row=1, col=2 )

fig.update_layout(height=500, width=1000,
                  title_text="Understanding patterns of revenue earned at weekly and monthly level")

fig.show()

Let us see which categories have the highest sales amount.

In [27]:
#the categories that are have highest sales amount
df=master.groupby('product_category_name_english')['total_amount'].sum().sort_values(ascending=False)
fig = px.bar(df, x= df.index,
             y=df.values,
             labels={'y':'Sales amount'},
             title='Product Category by sales amount',
             width=1000, height=800      )
fig.update_xaxes(tickangle= -90) 
fig.show()

Observations:¶

  • Health_beauty , watches_gift, bed_bath_table, computer_asscesories and sports_leisure are the top category by sales amount.
  • PC_games, cds_dvds_musicals, fashion_children_clothes are the lowest earning products categories.

3.2 Decomposing time series¶

We will be decomposing the time series using additive decomposition so that we can observe the underlying trend, seasonality and residuals.

Additive Decomposition : $Trend$+$Seasonality$+$Residual$

In [28]:
# decompose the time series
decomposition = tsa.seasonal_decompose(daily_data, model='additive')
In [29]:
#saving copy to new datafrme
daily_df=daily_data.copy()
In [30]:
# add the decomposition data
daily_df['Trend'] = decomposition.trend
daily_df['Seasonal'] = decomposition.seasonal
daily_df['Residual'] = decomposition.resid
In [31]:
#plotting the actual and decomposed componenets of time series
cols = ["total_amount","Trend", "Seasonal", "Residual"]

fig = make_subplots(rows=4, cols=1, subplot_titles=cols)

for i, col in enumerate(cols):
    fig.add_trace(
        go.Scatter(x=daily_df.index, y=daily_df[col]),
        row=i+1,
        col=1
    )

fig.update_layout(height=800, width=1000, showlegend=False)
fig.show()

Observation:¶

  • We can see that slightly upward trend. Trend has a peak on Nov 26, 2017 beacuse of the black friday sale on Nov 24, 2017. It falls afterwards but then rises again. Although this black friday is an outlier but we should consider it in our calculatiobn as it is an important factor.
  • There is a weekly seasonlality. It peaks once in the week and then falls.
  • There is no clear pattern in Residual. It has captured the peaks of Nov 24, 2017 ans Sept 29, 2017.
In [32]:
plt.subplots(figsize=(10,5))
# sns.set_theme(style= 'darkgrid' )
plt.title('Pattern of revenue earned at week level')
df_week_check=daily_data.copy()
df_week_check['Weekday']= df_week_check.index.weekday
week_day=['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
df_week_check['Weekday']=df_week_check['Weekday'].apply(lambda x: week_day[x])          
sns.boxplot(data=df_week_check, x='Weekday', y='total_amount')
Out[32]:
<AxesSubplot: title={'center': 'Pattern of revenue earned at week level'}, xlabel='Weekday', ylabel='total_amount'>

Observations:¶

  • We can see that the amount earned is high on monday and tuesday. It decrease towards the end of week.
  • There are some outliers on Friday and Saturday because of the holidays.

We can now proceed ahead with the checking stationarity of the time series so that we can apply modelling.

3.3 Checking stationarity¶

Forecasting is bulit on assumption that historical data is representative of the future. It is important for a time series to be stationary, if it is not it means that that data changes over time and it will be hard to forecast that data.

If the mean and variance of the time series are constant that means the time series is sationary. We will plot the rolling mean and rolling standard deviation for the time series to visually chacke for stationarity.

  • A rolling average can help you find trends that would otherwise be hard to detect.
  • Volatility is based on standard deviation, a measure of how much the data (stock prices) varies from the average - or the measure of spread.

https://www.linkedin.com/pulse/qb4-bollinger-bands-rolling-mean-standard-deviation-alan-mcdowell/

In [33]:
#plotting rolling mean and standard deviation.

# Get Things Rolling
roll_mean = daily_df['total_amount'].rolling(window=7).mean()
roll_std = daily_df['total_amount'].rolling(window=7).std()
    
# Figure
fig, ax = plt.subplots(figsize=(15,7), facecolor='w')
ax.plot(daily_df['total_amount'], label='Original')
ax.plot(roll_mean, label='Rolling Mean')
ax.plot(roll_std,  label='Rolling STD')
    
# Legend & Grid
ax.legend(loc='upper right')
plt.grid(linestyle=":", color='grey')
plt.show()

Observation:¶

  • The mean is not constant. As we progress on time series the roling mean is having somewhat upward trend.
  • There rolling mean rises and falls within two month period.
  • Rolling standard deviation also doesn't look constant.
  • Therefore, we can say that the trend is not stationary.

Statistical tests to check stationarity¶

ADF - Augmented Dickey Fuller Test¶

ADF test is used to determine the presence of unit root in the series, the presence of a unit root means the time series is non-stationary. Besides, the number of unit roots contained in the series corresponds to the number of differencing operations required to make the series stationary. The null and alternate hypothesis of this test are:

$H_0$ : The series has a unit root.

$H_a$: The series has no unit root.

If the null hypothesis is failed to be rejected (the p-value obtained is greater than the significance level (say 0.05)), this test may provide evidence that the series is non-stationary.

KPSS - Kwiatkowski-Phillips-Schmidt-Shin¶

KPSS is another test for checking the stationarity of a time series. The null and alternate hypothesis for the KPSS test are opposite that of the ADF test.

$H_0$ : The process is trend stationary.

$H_a$: The series has a unit root (series is not stationary).

In [34]:
def perform_adf_test(df) -> None:
    """
    Augmented Dickey Fuller Test
    - The null hypothesis for this test is that there is a unit root.
    - The alternate hypothesis is that there is no unit root in the series.
    ---
    Args:
        df (pd.DataFrame): Dataframe contains the timeseries data
        
    Returns: None
    """
    
    adf_stat, p_value, n_lags, n_observ, crit_vals, icbest = adfuller(df)
    
    print('\nAugmented Dickey Fuller Test')
    print('---'*15)
    print('ADF Statistic: %f' % adf_stat)
    print('p-value: %f' % p_value)
    print(f'Number of lags used: {n_lags}')
    print(f'Number of observations used: {n_observ}')
    print(f'T values corresponding to adfuller test:')
    for key, value in crit_vals.items():
        print(key, value)


def perform_kpss_test(df) -> None:
    """
    Kwiatkowski-Phillips-Schmidt-Shin test for stationary.
    - The null hypothesis for the test is that the data is stationary.
    - The alternate hypothesis for the test is that the data is not stationary.
    ---
    Args:
        df (pd.DataFrame): Dataframe that contains the timeseries data
        
    Returns: None
    """
    
    kpss_stat, p_value, n_lags, crit_vals = kpss(df, nlags='auto', store=False)
    print('\nKwiatkowski-Phillips-Schmidt-Shin test')
    print('---'*15)
    print('KPSS Statistic: %f' % kpss_stat)
    print('p-value: %f' % p_value)
    print(f'Number of lags used: {n_lags}')
    print(f'Critical values of KPSS test:')
    for key, value in crit_vals.items():
        print(key, value)

ADF and KPSS test on original Sales amount¶

In [35]:
print("ADF and KPSS test on original total_amount")
print("******************************************")
perform_adf_test(daily_df['total_amount'])
perform_kpss_test(daily_df['total_amount'])
ADF and KPSS test on original total_amount
******************************************

Augmented Dickey Fuller Test
---------------------------------------------
ADF Statistic: -3.575418
p-value: 0.006249
Number of lags used: 8
Number of observations used: 597
T values corresponding to adfuller test:
1% -3.4413510722333087
5% -2.8663934413235266
10% -2.5693547658168003

Kwiatkowski-Phillips-Schmidt-Shin test
---------------------------------------------
KPSS Statistic: 2.703197
p-value: 0.010000
Number of lags used: 15
Critical values of KPSS test:
10% 0.347
5% 0.463
2.5% 0.574
1% 0.739

Observation and findings:¶

ADF & KPSS Results on Original sales data

  • Since ADF Statistic p-value: 0.006249 < 0.05 we can reject the $H_{0}$ hypothesis in the favor of $H_{a}$
  • Since KPSS Statistic pvalue: 0.01 < 0.05 we can reject the $H{0}$ hypothesis in favour of $H_{a}$.

Based on these results we can conclude that:

  • According to ADF test our series have no unit root. Thereby, inferring that the series is stationary.
  • According to KPSS test our series is not trend-stationary.

Taking refrence from statsmodels which says if :
KPSS indicates non-stationarity and ADF indicates stationarity - The series is difference stationary. Differencing is to be used to make series stationary. The differenced series is checked for stationarity.

We will proceed ahead with differencing the time series and recheck for stationarity.

Differencing the time series¶

We will try to difference the series by differencing it with a previous day observation.

In [36]:
#differencing with previous day
daily_df["day_difference"] = daily_df["total_amount"].diff(1)
In [37]:
print("  ADF and KPSS test on differnced data  ")
print("******************************************")
perform_adf_test(daily_df["day_difference"].dropna())
perform_kpss_test(daily_df["day_difference"].dropna())
  ADF and KPSS test on differnced data  
******************************************

Augmented Dickey Fuller Test
---------------------------------------------
ADF Statistic: -9.426039
p-value: 0.000000
Number of lags used: 12
Number of observations used: 592
T values corresponding to adfuller test:
1% -3.441444394224128
5% -2.8664345376276454
10% -2.569376663737217

Kwiatkowski-Phillips-Schmidt-Shin test
---------------------------------------------
KPSS Statistic: 0.108792
p-value: 0.100000
Number of lags used: 29
Critical values of KPSS test:
10% 0.347
5% 0.463
2.5% 0.574
1% 0.739

Observations and findings:¶

ADF & KPSS Results on differenced data

  • Since ADF Statistic p-value: 0.000 < 0.05 we can reject the $H_{0}$ hypothesis in the favor of $H_{a}$
  • Since KPSS Statistic pvalue: 0.1 > 0.05 we cannot reject the $H{0}$ hypothesis in favour of $H_{a}$.

Based on these results we can conclude that:

  • According to ADF test our series have no unit root. Thereby, inferring that the series is stationary.
  • According to KPSS test our series is not trend-stationary.

Both tests conclude that - The series is stationary.

We can proceed ahead with modelling.

4. Preparation for Modeling¶

We want to create some functions that we will call again and again. Hence we will be creating a test and tarin split

4.1 Train and test split¶

We will be splitting the series into train and test. We will not be splitting the train for a validation set as we have a limited number of data.

In [38]:
def train_test_split(df, train_end, test_set):
    """
    Splits input dataframe into test and train set with 80% / 20%.
    ---
    Args:
        df : dataframe to split with datetime index.
        train_end: end date of the train set (inclusive), it can be a datetime or string of format YYYY-MM-DD.
        test_end: end date of the test set

    Returns:
        train_df (pd.DataFrame): Train Dataframe
        test_df (pd.DataFrame):  Test Dataframe
    """
    train_set = df[df.index <= train_end]
    test_set = df[df.index > train_end]
    return train_set, test_set


train_end = '2018-4-30'
test_end = '2018-8-29'

train_df, test_df = train_test_split(daily_data, train_end, test_end)
In [39]:
print(f'The Train data has time range :Shape {train_df.shape} | {train_df.index[0]} to {train_df.index[-1]}')
print(f'The Test data has time range :Shape {test_df.shape} | {test_df.index[0]} to {test_df.index[-1]}' )
The Train data has time range :Shape (485, 1) | 2017-01-01 00:00:00 to 2018-04-30 00:00:00
The Test data has time range :Shape (121, 1) | 2018-05-01 00:00:00 to 2018-08-29 00:00:00

4.2 Defining functions for plotting predictions and forecast¶

In [40]:
def plot_forecast(train_set, test_set, fc_series:pd.Series) -> None:
    """
    This function plots the train, test and forecast values.
    ---
    Args:
        train_df:  training dataframe with datetime index and only one column y
        test_df :  test dataframe with datetime index and only one column y
        fc_series: forecast series 
        
    Returns: None
    """
    
    fig = go.Figure()
    fig.add_trace(go.Scatter(x=train_set.index, y=train_set.values, mode='lines', name="Train"))
    fig.add_trace(go.Scatter(x=test_set.index, y=test_set.values, mode='lines', name="Test"))
    fig.add_trace(go.Scatter(x=fc_series.index, y=fc_series.values, mode='lines', name="Forecast"))

    fig.update_xaxes(rangeslider_visible=True)
    fig.update_layout(
        yaxis_title="Revenue amount", 
        xaxis_title="Date",
        title="Daily Sales amount and forecast"
    )
    fig.show()
In [41]:
def plot_test_predictions(test_set, predictions) -> None:
    """
    This functions plots test set vs predicted values.
    ---
    Args:
        test_df : test dataframe with datetime index and only one column y
        predictions (predictions): prediction values (array or list)
    
    Returns: None
    """
    test_set=pd.Series(test_set, index= test_set.index)
    predictions=pd.Series(predictions, index= test_set.index)
    
    fig = go.Figure()
    fig.add_trace(go.Scatter(x=test_set.index, y=test_set.values, mode='lines', name="Test"))
    fig.add_trace(go.Scatter(x=predictions.index, y=predictions.values, mode='lines', name="Predictions"))

    fig.update_xaxes(rangeslider_visible=True)
    fig.update_layout(
        yaxis_title="Revenue amount for test and predicted values", 
        xaxis_title="Date",
        title="Daily revenue amount"
    )
    fig.show()

4.3 Defining functions for evaluation¶

Throughout this notebook we will be using these two function to evaluate the performance of our model. We will be defining functions to calculate MAPE and RMSE. If we have Y as actual value and Predictions as predicted value for n number of observations then:

MAPE (Mean Absolute Percentage Error): It is a simple average of absolute percentage errors. It is calculated by

$$ \frac{1}{n} \sum_{i=1}^{n} {| \frac{Y_{actual_i} - Predictions_{i}}{Y_{actual_i}} |} \times{100} $$

RMSE (Root Mean Sqaured Error) : It is the square root of the average of the squared difference between the original and predicted values in the data set.

$$ \sqrt{\frac{1}{n} \sum_{i=1}^{n} {{(Y_{actual_i} - Predictions_{i})}^2 }} $$
In [42]:
def mape_metrics(test_set, predicted) -> float:
    """
    This function calculates the MAPE.
    ---
    Args:
        test_set (pd.Series):  test set filtered series with y
        predicted (pd.Series):  predicted series
        
    Returns: float MAPE percentage
    """
    # Calculate the MAPE value and return
    mape_result=round(np.mean(np.abs((test_set - predicted) / test_set)) * 100, 2)
    return mape_result

def rmse_metrics(test_set, predicted) -> float:
    """
    This function calculates the RMSE.
    ---
    Args:
        test_set (pd.Series):  test set filtered series with y
        predicted (pd.Series):  predicted series
        
    Returns: float RMSE
    """  
    # Calculate the MAPE value and return
    return round(np.sqrt(np.mean((test_set - predicted)**2)),2)

5. Modelling (SARIMA)¶

We will start with SARIMA model to account for the seasonality in our model. SARIMA is Seasonal Autoregressive Integrated Moving Average, which explicitly supports univariate time series data with a seasonal component. Before jumping on to modelling, we need to get a basic understanding of what orders for Auto gregressive and Moving average to choose. We will plot the ACF and PACF plots to find it out.

ACF : Auto correlation function, describes correlation between original and lagged series. PACF : Partial correlation function is same as ACF but it removes all intermediary effects of shorter lags, leaving only the direct effect visible.

5.1 Plotting ACF and PACF plot¶

In [43]:
def plot_acf_pacf(df, acf_lags: int, pacf_lags: int) -> None:
    """
    This function plots the Autocorrelation and Partial Autocorrelation lags.
    ---
    Args:
        df (pd.DataFrame): Dataframe contains the order count and dates.
        acf_lags (int): Number of ACF lags
        pacf_lags (int): Number of PACF lags
    Returns: None
    """
    
    # Figure
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16,9), facecolor='w')
    
    # ACF & PACF
    plot_acf(df, ax=ax1, lags=acf_lags)
    plot_pacf(df, ax=ax2, lags=pacf_lags, method='ywm')

    # Labels
    ax1.set_title(f"Autocorrelation {df.name}", fontsize=15, pad=10)
    ax1.set_ylabel("Sales amount", fontsize=12)
    ax1.set_xlabel("Lags (Days)", fontsize=12)

    ax2.set_title(f"Partial Autocorrelation {df.name}", fontsize=15, pad=10)
    ax2.set_ylabel("Sales amount", fontsize=12)
    ax2.set_xlabel("Lags (Days)", fontsize=12)
    
    # Legend & Grid
    ax1.grid(linestyle=":", color='grey')
    ax2.grid(linestyle=":", color='grey')

    plt.show()
In [44]:
#plotting the ACF and PACF plot for original series
plot_acf_pacf(daily_df['total_amount'], acf_lags=56, pacf_lags= 56)

Observation:¶

ACF plot:

  • It shows that there are a lot of significant lags. The ACF plot is decaying very slowly but none of the lags are becoming zero. It means that our data is not stationary as we have explained using statistic test and observation of rolling mean and standard deviation.
  • Since the ACF plot is decaying it is an AR model and its order can be determined from PACF model.
  • We will need to difference the model so that we can identify some significant lags.
  • We can see that Lag peaks after 7 days. This is the seasonlity of the model.

PACF plot:

  • PACF model has a few significant lags but the plot is not decaying much and has a very little oscillation. So it is hard to say or identify if moving averages can be utilized on this model.

We will try to plot the ACF and PACF plot by double differncing means differencing the day_difference with seasonal differnce data.

In [45]:
#double differencing the column total_amount
daily_df['double_difference'] = daily_df['day_difference'].diff(7)
In [46]:
#plotting the ACF and PACF plot for double differenced series
plot_acf_pacf(daily_df['double_difference'].dropna(), acf_lags=56, pacf_lags= 56)

Observation:¶

It little difficult to tell, simply from a time plot, what values of p and q are appropriate for the data. But we will try to find the appropriate orders.

ACF plot:

  • A lot of lags have become zero now. There are few lags that are significant 1, 4, 6, 7 and 8.
  • We can use non seasonal MA of order 1, 4, 6 but the most significant lag is 1.
  • The lag at 7 is also important and can help determing the seasonal MA component.

PACF plot:

  • It shows signigicant peaks at every seasonal lag of 7 days and peaks are diminishing. Also the non-seasonal lags are also diminishing. So we can use seasonl and non-seasonal MA components in model. The MA components can be determined using the significant lags from ACF plot.
  • Apart from seasonl lags the lag of 1, 2 and 3 are also significant. We can one of these lags for our non-seasonl AR component.

5.2 Applying SARIMA model ¶


The SARIMA model is specified

$$SARIMA(p, d, q) \times (P, D, Q)_s$$

Where:

  • Trend Elements are:
    • p: Autoregressive order
    • d: Difference order
    • q: Moving average order
  • Seasonal Elements are:
    • P: Seasonal autoregressive order.
    • D: Seasonal difference order. D=1 would calculate a first order seasonal difference
    • Q: Seasonal moving average order. Q=1 would use a first order errors in the model
    • s: Single seasonal period

Theoretical estimates:¶

  • s: In our PACF plot there is peak that reappears every 7 days. Thus, we can set seasonal period to s = 7. This also backed by our seasonal component after additive decomposition.
  • p: We observed that there is some tappering in ACF plot and we found the significant lags of 1,2,3 from PACF plot. We can start with p=1 and see how it works.
  • d: We observed that our series has some trend, so we can remove it by differencing, so d = 1.
  • q: Based on our ACF correlations we can set q = 1 since its the most significant lag.
  • P: P = 0 as we are using ACF plot to find seasonl significant lag.
  • D: Since we are dealing with seasonality and we need to differnce the series, D = 1
  • Q: The seasonal moving average will be set to Q = 1 as we found only one significant seasonal lag in ACF plot. Here we go:
$$ SARIMA(1, 1, 1) \times (0, 1, 1)_{7} $$

Baseline Sarima Model¶

In [47]:
# Set Hyper-parameters
p, d, q = 1, 1, 1
P, D, Q = 0, 1, 1
s = 7

# Fit SARIMA
sarima_model = SARIMAX(train_df['total_amount'], order=(p, d, q), seasonal_order=(P, D, Q, s))
sarima_model_fit = sarima_model.fit(disp=0)
print(sarima_model_fit.summary())
                                     SARIMAX Results                                     
=========================================================================================
Dep. Variable:                      total_amount   No. Observations:                  485
Model:             SARIMAX(1, 1, 1)x(0, 1, 1, 7)   Log Likelihood               -5042.977
Date:                           Wed, 15 Feb 2023   AIC                          10093.954
Time:                                   08:55:12   BIC                          10110.624
Sample:                               01-01-2017   HQIC                         10100.509
                                    - 04-30-2018                                         
Covariance Type:                             opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ar.L1          0.2609      0.044      5.920      0.000       0.175       0.347
ma.L1         -0.8489      0.032    -26.659      0.000      -0.911      -0.787
ma.S.L7       -0.9525      0.020    -48.088      0.000      -0.991      -0.914
sigma2      8.693e+07   1.58e-10    5.5e+17      0.000    8.69e+07    8.69e+07
===================================================================================
Ljung-Box (L1) (Q):                   0.01   Jarque-Bera (JB):            279160.07
Prob(Q):                              0.93   Prob(JB):                         0.00
Heteroskedasticity (H):              12.95   Skew:                             8.05
Prob(H) (two-sided):                  0.00   Kurtosis:                       120.42
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
[2] Covariance matrix is singular or near-singular, with condition number 1.76e+32. Standard errors may be unstable.
In [48]:
# Plot diagnostics
sarima_model_fit.plot_diagnostics(figsize=(16,7))
plt.show()

Observations:¶

  • The standardize residual plot: The residuals appear as white noise. It looks like the residual of the decomposed time series.
  • The Normal Q-Q-plot: Shows that the ordered distribution of residuals follows the linear trend of the samples taken from a standard normal distribution with N(0, 1). There are some outlier as we have seen earlier.
  • Histogram and estimated density plot: The KDE follows the N(0,1) line however with noticeable differences. As mentioned before our distribution has heavier tails.
  • The Correlogram plot: Shows that the time series residuals have low correlation with lagged versions of itself. Meaning there are no patterns left to extract in the residuals.

Lets test the model on our training set:

5.3 Plotting predictions and evaluating SARIMA model¶

Prediction using SARIMA¶

In [49]:
# defining prediction period
pred_start_date = test_df.index[0]
pred_end_date = test_df.index[-1]

sarima_predictions = sarima_model_fit.predict(start=pred_start_date, end=pred_end_date)
sarima_residuals = test_df['total_amount'] - sarima_predictions

Evaluation of SARIMA¶

In [50]:
# Get evaluation data
sarima_root_mean_squared_error = rmse_metrics(test_df['total_amount'], sarima_predictions)
sarima_mape_error = mape_metrics(test_df['total_amount'], sarima_predictions)

print(f'Root Mean Squared Error | RMSE: {sarima_root_mean_squared_error}')
print(f'Mean Absolute Percentage Error | MAPE: {sarima_mape_error}')
Root Mean Squared Error | RMSE: 13810.6
Mean Absolute Percentage Error | MAPE: 68.99

We are able to get a MAPE of 69.99 % and RMSE of 13810.6.

Plotting test and prediction¶

In [51]:
plot_test_predictions(test_df['total_amount'], sarima_predictions)

Observations:¶

Our baseline model is able to capture the seasonality and trend component but is not able to pick up the variations between the days..

Sarima Forecast¶

We will try to forecast the sales for next 180 days. We have the 121 days know from our test data and we will try to see what our model forcasts for next 60 days.

In [52]:
# Forecast Window
days = 180

sarima_forecast = sarima_model_fit.forecast(days)
sarima_forecast_series = pd.Series(sarima_forecast, index=sarima_forecast.index)

# Since negative orders are not possible we can trim them.
sarima_forecast_series[sarima_forecast_series < 0] = 0

Plotting Forecast using baseline SARIMA¶

In [53]:
plot_forecast(train_df['total_amount'], test_df['total_amount'], sarima_forecast_series)

Observations:¶

  • The model predicts the overall daily patterns pretty well.
  • Is not performning well to pick up the variation between days.
  • It positively trending and is not capturing the peaks and toughs.
  • We will need to tune it further and should also add another feature holiday so that it can pick some informations from that.
  • While this model doesn't have a great long term predictive power it can serve as a solid baseline for our next models.

5.4 Adding Exogenous variable holiday for SARIMAX¶

Let us include the exogenous variable holiday to imporve our model.

In [54]:
#reading the data from holiday file 
holiday=pd.read_csv('data_cleaned/holiday.csv', index_col=0)
In [55]:
#reading head
holiday.info()
<class 'pandas.core.frame.DataFrame'>
Index: 36 entries, 2017-01-01 to 2018-12-25
Data columns (total 1 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   is_holiday  36 non-null     float64
dtypes: float64(1)
memory usage: 576.0+ bytes
In [56]:
#converting index to date time format
holiday.index=pd.to_datetime(holiday.index)
In [57]:
#reading head
holiday.head()
Out[57]:
is_holiday
2017-01-01 1.0
2017-02-27 1.0
2017-02-28 1.0
2017-03-01 1.0
2017-04-14 1.0

Our holiday dataframe only has dates when there is a holiday but not all the dates. We can fill the missing dates and change the is_holiday column to int format.

In [58]:
#fill the missing dates 
idx = pd.date_range('2017-01-01', '2018-12-31')
holiday = holiday.reindex(idx, fill_value=0)

#converting is_holiday column to int type
holiday['is_holiday']=holiday['is_holiday'].astype(int)
In [59]:
#checking the info
holiday.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 730 entries, 2017-01-01 to 2018-12-31
Freq: D
Data columns (total 1 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   is_holiday  730 non-null    int32
dtypes: int32(1)
memory usage: 8.6 KB

Since our data has impact of balck friday we can add few holidays like valentine day and black friday day as holiday.

In [60]:
#adding some more days as holidays like valentine day and black friday
#feb 14, Nov 24 2017 and Nov-23 2018
holiday.loc[((holiday.index.day== 14) & (holiday.index.month==2)),:]=1
holiday.loc[((holiday.index== '2017-11-24') | (holiday.index== '2018-11-23') ),:]=1
In [61]:
#plot to show holidays
plt.figure(figsize=(100,10))
sns.heatmap(holiday.T, cmap="flare", xticklabels=[])
plt.show()
In [62]:
#filter ing the holidays only upto the end of our test data period.
holiday_df=holiday.loc[holiday.index<='2018-08-29']
In [63]:
#adding the holiday data to our new dataframe.
dfex=pd.concat([daily_data, holiday_df], axis=1)
In [64]:
#plotting daily data to get high level picture

fig = px.line(dfex, x=dfex.index, y='total_amount')

# axis labels and title
fig.update_layout(
    yaxis_title="Total Revenue earned (Brazilian Real)", 
    legend_title="", 
    title="Daily Revenue from Jan 2017 to Aug 2018"
)
v_line=dfex.loc[dfex['is_holiday']==1].index
for idx in v_line:
    fig.add_vline(idx,line_width=0.5, line_dash="dash", line_color="red")

# activate slider
fig.update_xaxes(rangeslider_visible=True)

fig.show()

SARIMAX using an Exogenous variable¶

We can split the df into test and train so that we can do some moseling on it.

In [65]:
#splitting into test and train
train_dfex, test_dfex = train_test_split(dfex, train_end, test_end)

5.5 Applying grid search on SARIMAX with exogenous variable¶

In [66]:
#instantiating order variables
p = range(1, 5)
d = range(1,2)
q = range(1, 3)
s = 7

P=range(0,1)

D=range(1,2)
Q=range(1,3)
pdq = list(itertools.product(p, d, q))
seasonal_pdq=list(itertools.product(P, D, Q))
seasonal_PDQS= [(x[0], x[1], x[2], s) for x in seasonal_pdq]


def grid_search_sarimax(train_set, test_set) -> None:
    """
    This function perform a grid search for SARIMAX model Hyper-Parameters.
    ---
    Args:
        train_set (pd.DataFrame): Training set for grid search.
        
    Returns: None
    """
    # Supress UserWarnings
    warnings.simplefilter('ignore', category=UserWarning)
    
    #instantiating variables
    pred_start_date = test_set.index[0]
    pred_end_date = test_set.index[-1]
    summary=pd.DataFrame(columns=['Model', 'AIC', 'BIC', 'MSE', 'SSE', 'RMSE', 'MAPE'])
    data={}
    # Grid Search
    for order in pdq:
        for seasonal_order in seasonal_PDQS:
            model = SARIMAX(train_set['total_amount'],
                            order=order,
                            seasonal_order=seasonal_order,
                            exog=(train_set[['is_holiday']]))
            results = model.fit(disp=0)
            predictions = results.predict(start=pred_start_date, end=pred_end_date, 
                                          exog=(test_set[['is_holiday']]))
            
            data= { "Model":f'{order}x{seasonal_order}',
                    "AIC":results.aic, 
                    "BIC":results.bic,
                    "MSE":results.mse, 
                    "SSE":results.sse,
                    "RMSE": rmse_metrics(test_set['total_amount'], predictions),
                    "MAPE": mape_metrics(test_set['total_amount'], predictions) }

            summary=pd.concat([summary, pd.DataFrame(data, columns=summary.columns, index=[1])], ignore_index=True)
            
    return summary


grid_search_sarimax(train_dfex, test_dfex)
Out[66]:
Model AIC BIC MSE SSE RMSE MAPE
0 (1, 1, 1)x(0, 1, 1, 7) 10098.803428 10119.641010 8.579247e+07 4.160935e+10 16149.51 81.59
1 (1, 1, 1)x(0, 1, 2, 7) 10098.205815 10123.210914 8.590413e+07 4.166351e+10 14130.02 70.50
2 (1, 1, 2)x(0, 1, 1, 7) 10186.915064 10211.920163 8.660065e+07 4.200131e+10 13312.72 65.66
3 (1, 1, 2)x(0, 1, 2, 7) 10188.869615 10218.042231 8.654337e+07 4.197354e+10 13383.17 66.10
4 (2, 1, 1)x(0, 1, 1, 7) 10098.721890 10123.726989 8.607758e+07 4.174762e+10 13809.34 68.67
5 (2, 1, 1)x(0, 1, 2, 7) 10186.651548 10215.824164 8.653358e+07 4.196879e+10 13448.51 66.51
6 (2, 1, 2)x(0, 1, 1, 7) 10086.691009 10115.863625 8.314456e+07 4.032511e+10 16061.22 81.02
7 (2, 1, 2)x(0, 1, 2, 7) 10184.337084 10217.677216 8.371883e+07 4.060363e+10 15112.30 75.99
8 (3, 1, 1)x(0, 1, 1, 7) 10182.809279 10211.981894 8.421900e+07 4.084622e+10 14936.32 75.13
9 (3, 1, 1)x(0, 1, 2, 7) 10184.273723 10217.613855 8.391110e+07 4.069689e+10 15021.04 75.55
10 (3, 1, 2)x(0, 1, 1, 7) 10148.537015 10181.877146 8.832691e+07 4.283855e+10 16411.13 83.02
11 (3, 1, 2)x(0, 1, 2, 7) 10183.265203 10220.772851 8.332192e+07 4.041113e+10 15102.15 75.95
12 (4, 1, 1)x(0, 1, 1, 7) 10181.694694 10215.034826 8.339000e+07 4.044415e+10 15026.73 75.58
13 (4, 1, 1)x(0, 1, 2, 7) 10183.326100 10220.833749 8.316294e+07 4.033403e+10 15083.97 75.86
14 (4, 1, 2)x(0, 1, 1, 7) 10175.171123 10212.678772 8.319527e+07 4.034971e+10 15159.34 76.28
15 (4, 1, 2)x(0, 1, 2, 7) 10176.405676 10218.080841 8.288848e+07 4.020091e+10 15167.14 76.28

SARIMA Order (1,1,2) x (0, 1,1,7) gives the minimum MAPE and RMSE on the test data. We will chose this model as a better model over our baseline.

Fitting the model on tuned SARIMAX¶

In [67]:
# Set Hyper-parameters
p, d, q = 1, 1, 2
P, D, Q = 0, 1, 1
s = 7

# Fit SARIMA
sarimax_model = SARIMAX(train_dfex['total_amount'], 
                       order=(p, d, q), 
                       seasonal_order=(P, D, Q, s), 
                       exog=(train_dfex[['is_holiday']]))
sarimax_model_fit = sarimax_model.fit(disp=0)
print(sarimax_model_fit.summary())
                                      SARIMAX Results                                      
===========================================================================================
Dep. Variable:                        total_amount   No. Observations:                  485
Model:             SARIMAX(1, 1, 2)x(0, 1, [1], 7)   Log Likelihood               -5087.458
Date:                             Wed, 15 Feb 2023   AIC                          10186.915
Time:                                     08:55:41   BIC                          10211.920
Sample:                                 01-01-2017   HQIC                         10196.747
                                      - 04-30-2018                                         
Covariance Type:                               opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
is_holiday  4390.8251   1903.409      2.307      0.021     660.212    8121.439
ar.L1          0.0746      0.527      0.142      0.887      -0.959       1.108
ma.L1         -0.6470      0.534     -1.211      0.226      -1.694       0.400
ma.L2         -0.1379      0.409     -0.337      0.736      -0.939       0.664
ma.S.L7       -0.9281      0.039    -23.640      0.000      -1.005      -0.851
sigma2      1.719e+08      0.583   2.95e+08      0.000    1.72e+08    1.72e+08
===================================================================================
Ljung-Box (L1) (Q):                   0.00   Jarque-Bera (JB):            200524.31
Prob(Q):                              0.99   Prob(JB):                         0.00
Heteroskedasticity (H):              11.31   Skew:                             7.11
Prob(H) (two-sided):                  0.00   Kurtosis:                       102.43
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
[2] Covariance matrix is singular or near-singular, with condition number 2.94e+22. Standard errors may be unstable.

5.6 Plotting predictions and evaluating SARIMAX model¶

Prediction using tuned SARIMAX¶

In [68]:
#predicting 
pred_start_date = test_df.index[0]
pred_end_date = test_df.index[-1]

sarimax_predictions = sarimax_model_fit.predict(start=pred_start_date, end=pred_end_date, exog=(test_dfex[['is_holiday']]) )

Evaluation of tuned SARIMAX¶

In [69]:
sarimax_root_mean_squared_error = rmse_metrics(test_dfex['total_amount'], sarimax_predictions)
sarimax_mape=mape_metrics(test_dfex['total_amount'], sarimax_predictions)

print(f'Root Mean Squared Error | RMSE: {sarimax_root_mean_squared_error}')
print(f'Mean Absolute Percentage Error | MAPE: {sarimax_mape}')
Root Mean Squared Error | RMSE: 13312.72
Mean Absolute Percentage Error | MAPE: 65.66

Plotting test and predictions of tuned SARIMAX¶

In [70]:
plot_test_predictions(test_dfex['total_amount'], sarimax_predictions)

SARIMAX Forecast¶

In [71]:
# Forecast Window
exog=holiday.loc[(holiday.index> '2018-04-30') & (holiday.index<= '2018-10-27') ]
exog_df=exog.copy()

days = 180
exog_param=(exog_df[['is_holiday']])
sarimax_forecast = sarimax_model_fit.forecast(days, exog=exog_param)
sarimax_forecast_series = pd.Series(sarimax_forecast, index=sarimax_forecast.index)
In [72]:
plot_forecast(train_dfex['total_amount'], test_dfex['total_amount'], sarimax_forecast_series)

Observations:¶

  • Although our MAPE and RMSE have been reduced after introduction of holidays and tuning the order but still the model is unable to pick up the daily variation that well.
  • It is capturing some small peaks but not very accurately.
  • We can try some othe models and see how they perform.

6. Modelling (Facebook Prophet)¶

FB Prophet is a forecasting package in Python that was developed by Facebook’s data science research team. The goal of the package is to give business users a powerful and easy-to-use tool to help forecast business results without needing to be an expert in time series analysis. We will apply this model and see how it performs.

6.1 Preparing data for FB Prophet¶

Faecbook prophet needs data in a certain format to be able to process it. The input to Prophet is always a dataframe with two columns: ds and y. The ds (datestamp) column should be of a format expected by Pandas, ideally YYYY-MM-DD for a date or YYYY-MM-DD HH:MM:SS for a timestamp. The y column must be numeric, and represents the measurement here in our case it is total_amount.

In [73]:
#preparing the dataframe for fbProphet

prophet_df=dfex['total_amount'].reset_index()
prophet_df.rename(columns={"index": "ds", "total_amount": "y"}, inplace=True)

#using our original train_df and test_df we will convert them into prophet train andt test set.
prophet_train = train_df["total_amount"].reset_index()
prophet_train.rename(columns={"order_purchase_timestamp": "ds", "total_amount": "y"}, inplace=True)
prophet_test = test_df["total_amount"].reset_index()
prophet_test.rename(columns={"order_purchase_timestamp": "ds", "total_amount": "y"}, inplace=True)
In [74]:
prophet_df.head()
Out[74]:
ds y
0 2017-01-01 0.0
1 2017-01-02 0.0
2 2017-01-03 0.0
3 2017-01-04 0.0
4 2017-01-05 396.9

6.2 Applying a Baseline FB Prophet¶

Since we observed that our data has positive trend and seasonality, we will set growth ='linear' and let the model find out appropriate seasonality by making yearly_seaonality, daily_seasonality and weekly_seasonality = True.

In [75]:
#instantiate the model
fb_baseline = Prophet(growth='linear', 
                yearly_seasonality=True, 
                daily_seasonality=True, 
                weekly_seasonality=True)
fb_baseline.fit(prophet_train)
Out[75]:
<fbprophet.forecaster.Prophet at 0x1d492326130>

Predictions using baseline Prophet¶

In [76]:
#make predictions dataframe 
future_base = fb_baseline.make_future_dataframe(periods=len(test_df), freq="D")
In [77]:
#make a forecast
forecast_base = fb_baseline.predict(future_base)
forecast_base[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()
Out[77]:
ds yhat yhat_lower yhat_upper
601 2018-08-25 38703.402238 27850.514400 49974.134276
602 2018-08-26 39793.914774 28853.592346 50549.915252
603 2018-08-27 47564.058526 37425.216632 58963.497141
604 2018-08-28 48007.781107 37422.322128 58954.139575
605 2018-08-29 46813.235399 35654.962406 58255.634010
In [78]:
forecast_base[-121:].reset_index()['yhat']
Out[78]:
0      41595.478779
1      40191.984960
2      40231.082802
3      40782.308981
4      34046.678525
           ...     
116    38703.402238
117    39793.914774
118    47564.058526
119    48007.781107
120    46813.235399
Name: yhat, Length: 121, dtype: float64

6.3 Plotting and Evaluating Baseline model¶

In [79]:
#evaluating on test set
fb_baseline_mape = mape_metrics(prophet_test['y'], forecast_base[-121:].reset_index()['yhat'] )
fb_baseline_rmse = rmse_metrics(prophet_test['y'], forecast_base[-121:].reset_index()['yhat'] )

print(f'Root Mean Squared Error | RMSE: {fb_baseline_rmse}')
print(f'Mean Absolute Percentage Error | MAPE: {fb_baseline_mape}')
Root Mean Squared Error | RMSE: 14904.05
Mean Absolute Percentage Error | MAPE: 75.28

Plotting the forecast using Baseline FB Prophet¶

In [80]:
from fbprophet.plot import plot_plotly

fig = plot_plotly(fb_baseline, forecast_base) 
fig.update_layout(
    title="Daily Sales amount",
    xaxis_title="Date",
    yaxis_title="Revenue amount"
    )
fig.show()

Observations:¶

  • Although the prophet didn't give us a good MAPE or RMSE yet form the plot we can see that it is able to capture seasonality, trend and some troughs.
  • It is worth to explore futher by tuning the hyper parameters and include the holiday impact

6.4 Tuning FB Prophet using Grid Search¶

Adding external variable Holiday¶

In [81]:
#preparing the holiday dataframe as per prophet
holiday_df_fb = pd.DataFrame({
  'holiday': 'Brazil holidays',
  'ds': pd.to_datetime(holiday_df.loc[holiday_df['is_holiday']==1].index)})

Adding impact of holiday on our baseline¶

In [82]:
fb_baseline_holi = Prophet(growth='linear', 
                           holidays=holiday_df_fb, 
                           yearly_seasonality=True, 
                           daily_seasonality=True, 
                           weekly_seasonality=True).add_country_holidays(country_name='BR')

fb_baseline_holi.fit(prophet_train)
Out[82]:
<fbprophet.forecaster.Prophet at 0x1d48b492b80>
In [83]:
#preparing forcast dataframe
future_base_holi = fb_baseline_holi.make_future_dataframe(periods=len(test_df), freq="D")
In [84]:
#doing acrtual forecast
forecast_base_holi = fb_baseline_holi.predict(future_base_holi)
forecast_base_holi[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()
Out[84]:
ds yhat yhat_lower yhat_upper
601 2018-08-25 39680.385242 27493.733309 50152.633038
602 2018-08-26 40618.882951 29738.338508 51236.224444
603 2018-08-27 48606.375757 37874.992661 59519.491629
604 2018-08-28 48832.434226 37915.910928 59524.345705
605 2018-08-29 47909.521241 36820.957799 59033.048757

Evaluating the model on test data¶

In [85]:
#evaluating on test set
fb_baseline_holi_mape=mape_metrics(prophet_test['y'], forecast_base_holi[-121:].reset_index()['yhat'] )
fb_baseline_holi_rmse=rmse_metrics(prophet_test['y'], forecast_base_holi[-121:].reset_index()['yhat'] )

print(f'Root Mean Squared Error | RMSE: {fb_baseline_holi_rmse}')
print(f'Mean Absolute Percentage Error | MAPE: {fb_baseline_holi_mape}')
Root Mean Squared Error | RMSE: 15393.64
Mean Absolute Percentage Error | MAPE: 77.88

Observations:¶

  • Our baseline model actually became worse. We should tune the parameters to improve it.

Tuning FB Prophet¶

We will try to tune the following parameters i.e. seasonality_mode, changepoint_prior_scale and holiday_prior_scale.

Lets us discuss these parameters.

Seasonality_mode : By default FB Prophet select additive but we want to give it a option to try multiplicative.

holidays_prior_scale. This parameter determines how much of an effect holidays should have on your predictions. Defalut is 10, This could also be tuned on a range of [0.01, 10].

Changepoints are the points in your data where there are sudden and abrupt changes in the trend. The other parameter, changepoint_prior_scale, tt determines the flexibility of the trend, and in particular how much the trend changes at the trend changepoints. The default of 0.05 works for many time series, but this could be tuned; a range of [0.001, 0.5] would likely be about right.

Reference

In [86]:
from sklearn.model_selection import ParameterGrid
params_grid = {'seasonality_mode':('multiplicative','additive'),
               'changepoint_prior_scale':[0.1, 0.4],
               'holidays_prior_scale':[0.1,0.2,0.3,0.4,0.5]}
grid = ParameterGrid(params_grid)
cnt = 0
for p in grid:
    cnt = cnt+1

print('Total Possible Models',cnt)
Total Possible Models 20

Hyperparameter tuning of FB Prophet¶

In [87]:
tunning_summary = pd.DataFrame(columns=['Param', 'RMSE', 'MAPE'])
import random
val=pd.DataFrame()
for p in grid:
#     print(p)
    random.seed(0)
    train_model =Prophet(growth='linear',
                         changepoint_prior_scale = p['changepoint_prior_scale'],
                         holidays_prior_scale = p['holidays_prior_scale'],
                         seasonality_mode = p['seasonality_mode'],
                         seasonality_prior_scale=20,
                         weekly_seasonality=True,
                         daily_seasonality = True,
                         yearly_seasonality = True,
                         holidays=holiday_df_fb)
    train_model.add_country_holidays(country_name='BR')
    train_model.fit(prophet_train)
    grid_forecast = train_model.make_future_dataframe(periods=121, freq='D',include_history = False)
    grid_forecast = train_model.predict(grid_forecast)
    val_df=grid_forecast[['ds','yhat']]
    mapes=np.mean(np.abs((prophet_test['y']-val_df['yhat'])/prophet_test['y']))*100
    rmses = np.sqrt(np.mean((prophet_test['y']-val_df['yhat'])**2))
    # Find the best parameters
    data= {"Param": f'{p}',
           "RMSE": rmses,
           "MAPE": mapes }
    
    tunning_summary=pd.concat([tunning_summary, pd.DataFrame(data, columns=tunning_summary.columns, index=[1])], ignore_index=True)
In [88]:
tunning_summary
Out[88]:
Param RMSE MAPE
0 {'changepoint_prior_scale': 0.1, 'holidays_pri... 14323.127238 69.067258
1 {'changepoint_prior_scale': 0.1, 'holidays_pri... 13831.198684 68.977829
2 {'changepoint_prior_scale': 0.1, 'holidays_pri... 15889.884132 76.295847
3 {'changepoint_prior_scale': 0.1, 'holidays_pri... 13933.034091 69.529761
4 {'changepoint_prior_scale': 0.1, 'holidays_pri... 15910.722963 76.326488
5 {'changepoint_prior_scale': 0.1, 'holidays_pri... 13961.363212 69.692900
6 {'changepoint_prior_scale': 0.1, 'holidays_pri... 15976.943243 76.537357
7 {'changepoint_prior_scale': 0.1, 'holidays_pri... 13995.316724 69.893073
8 {'changepoint_prior_scale': 0.1, 'holidays_pri... 13712.466581 64.318053
9 {'changepoint_prior_scale': 0.1, 'holidays_pri... 14014.377104 69.986390
10 {'changepoint_prior_scale': 0.4, 'holidays_pri... 15630.137007 74.832688
11 {'changepoint_prior_scale': 0.4, 'holidays_pri... 13140.371293 64.690013
12 {'changepoint_prior_scale': 0.4, 'holidays_pri... 13052.161061 60.729572
13 {'changepoint_prior_scale': 0.4, 'holidays_pri... 13153.335184 64.719378
14 {'changepoint_prior_scale': 0.4, 'holidays_pri... 12283.063672 55.248070
15 {'changepoint_prior_scale': 0.4, 'holidays_pri... 13158.441265 64.740393
16 {'changepoint_prior_scale': 0.4, 'holidays_pri... 12151.948994 53.336248
17 {'changepoint_prior_scale': 0.4, 'holidays_pri... 13175.698072 64.846211
18 {'changepoint_prior_scale': 0.4, 'holidays_pri... 12142.685263 51.453872
19 {'changepoint_prior_scale': 0.4, 'holidays_pri... 13186.650847 64.915281
In [89]:
#filtering the row with minimum MAPE
tunning_summary.loc[tunning_summary['MAPE']==tunning_summary['MAPE'].min(),:]
Out[89]:
Param RMSE MAPE
18 {'changepoint_prior_scale': 0.4, 'holidays_pri... 12142.685263 51.453872
In [90]:
#getting the parameter values
tunning_summary.iloc[18, :]['Param']
Out[90]:
"{'changepoint_prior_scale': 0.4, 'holidays_prior_scale': 0.5, 'seasonality_mode': 'multiplicative'}"
In [91]:
#fitting the model on tuned parameters 
fb_tuned = Prophet(growth='linear',
                   changepoint_prior_scale= 0.4,
                   holidays_prior_scale= 0.5,
                   seasonality_mode= 'multiplicative',
                   seasonality_prior_scale=20,
                   holidays=holiday_df_fb,
                   yearly_seasonality=True, 
                   daily_seasonality=True, 
                   weekly_seasonality=True).add_country_holidays(country_name='BR')

fb_tuned.fit(prophet_train)
Out[91]:
<fbprophet.forecaster.Prophet at 0x1d495095cd0>
In [92]:
#make future dataframe
fb_tuned_future=fb_tuned.make_future_dataframe(periods=121, freq="D")
# ,include_history = False
In [93]:
#forecasting
forecast_tuned = fb_tuned.predict(fb_tuned_future)
In [94]:
forecast_tuned [['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()
Out[94]:
ds yhat yhat_lower yhat_upper
601 2018-08-25 21454.618193 11376.194819 32451.621645
602 2018-08-26 22754.149194 11899.356892 32786.712348
603 2018-08-27 35314.865637 24548.772837 46298.220030
604 2018-08-28 34959.709975 24416.002477 45298.598274
605 2018-08-29 33316.214377 23193.749826 44135.647453

6.5 Plotting and evaluating Tuned FB Prophet¶

Evaluating tuned FB Prophet¶

In [95]:
#evaluating on test set
fb_tuned_mape=mape_metrics(prophet_test['y'], forecast_tuned[-121:].reset_index()['yhat'] )
fb_tuned_rmse=rmse_metrics(prophet_test['y'], forecast_tuned[-121:].reset_index()['yhat'] )

print(f'Root Mean Squared Error | RMSE: {fb_tuned_rmse}')
print(f'Mean Absolute Percentage Error | MAPE: {fb_tuned_mape}')
Root Mean Squared Error | RMSE: 12142.69
Mean Absolute Percentage Error | MAPE: 51.45
In [96]:
fig = plot_plotly(fb_tuned, forecast_tuned ) 
fig.update_layout(
    title="Daily Sales amount",
    xaxis_title="Date",
    yaxis_title="Revenue amount"
    )
fig.show()

Observations:¶

  • We have recieved a much better performance with tuned FB Prophet. Our MAPE is 51.45 % without much effort.
  • It is able to pick seasonality and has captured some peaks and troughs.
  • We will proceed ahead and keep this model as our benchmark and we will explore other models like XG Boost Regressor and LSTM.
In [97]:
# prophet_test['y'].mean()
In [98]:
# forecast_tuned['yhat'][-121:].mean()

Summary:¶

Model RMSE MAPE
SARIMA(1,1,1)(0,1,1)(7) 13810.59 68.99
SARIMAX(1,1,2)(0,1,1)(7) Including impact of holidays 13312.72 65.66
Baseline Prophet 27437.77 71.78
Baseline Prophet with holiday 15393.64 77.88
Tuned Prophet with holiday 12142.69 51.45

7. Modelling (XG Boost Regression)¶

XGBoost is an efficient implementation of gradient boosting that can be used for regression predictive modeling. We will try to implement this model.

For XGBoost we cannot feed a time series as such, we will have to extract date time features from our order timestamp in order to predict 'total_amount'.

7.1 Preparing data for XG Boost Regression¶

We will be creating a function to extract features from date. We will alos add holiday as a feature.

In [99]:
#function to extract features from date
def create_features(dataframe, label=None):
    df=dataframe.copy()
    df['date'] = df.index
    df['dayofweek'] = df['date'].dt.dayofweek
    df['quarter'] = df['date'].dt.quarter
    df['month'] = df['date'].dt.month
    df['year'] = df['date'].dt.year
    df['dayofyear'] = df['date'].dt.dayofyear
    df['dayofmonth'] = df['date'].dt.day
    df['weekofyear'] = df['date'].dt.isocalendar().week.astype(int)
    X = df[['dayofweek', 'quarter', 'month', 'year', 'dayofyear','dayofmonth', 'weekofyear']]
    if label:
        y = df[label]
        return X, y
    return X
In [100]:
#creating dataframe with only required columns
daily_data_xgb=dfex[['total_amount', 'is_holiday']]
In [101]:
#Separating X and y 
X, y= create_features(daily_data_xgb, label='total_amount')
X['is_holiday']=daily_data_xgb['is_holiday']
In [102]:
X.head()
Out[102]:
dayofweek quarter month year dayofyear dayofmonth weekofyear is_holiday
2017-01-01 6 1 1 2017 1 1 52 1
2017-01-02 0 1 1 2017 2 2 1 0
2017-01-03 1 1 1 2017 3 3 1 0
2017-01-04 2 1 1 2017 4 4 1 0
2017-01-05 3 1 1 2017 5 5 1 0
In [103]:
#splitting the data for train andt test

train_end = '2018-4-30'
test_end = '2018-8-29'

X_train, X_test = train_test_split(X, train_end, test_end)
y_train, y_test = train_test_split(y, train_end, test_end)

7.2 Baseline model on daily data and its performance¶

Let us apply a baseline XG Boost Regression and we will then evaluate its performance.

In [104]:
#instantiate the XGBoost Model
reg = xgb.XGBRegressor(n_estimators=1000, max_depth=3)
reg.fit(X_train, y_train,
        eval_set=[(X_train, y_train), (X_test, y_test)],
        early_stopping_rounds=50,
       verbose=True)
[0]	validation_0-rmse:20091.22974	validation_1-rmse:25494.23244
[1]	validation_0-rmse:15569.06877	validation_1-rmse:19173.80315
[2]	validation_0-rmse:12561.60000	validation_1-rmse:15332.13670
[3]	validation_0-rmse:10761.74494	validation_1-rmse:13457.44742
[4]	validation_0-rmse:9153.27296	validation_1-rmse:12329.47857
[5]	validation_0-rmse:8018.61914	validation_1-rmse:11726.32667
[6]	validation_0-rmse:7526.28509	validation_1-rmse:11550.97415
[7]	validation_0-rmse:6865.40326	validation_1-rmse:11500.50474
[8]	validation_0-rmse:6378.20680	validation_1-rmse:11493.59336
[9]	validation_0-rmse:5999.63606	validation_1-rmse:11537.69003
[10]	validation_0-rmse:5824.62760	validation_1-rmse:11583.81280
[11]	validation_0-rmse:5589.63328	validation_1-rmse:11600.04201
[12]	validation_0-rmse:5387.86805	validation_1-rmse:11544.21935
[13]	validation_0-rmse:5348.55424	validation_1-rmse:11567.96099
[14]	validation_0-rmse:5165.61167	validation_1-rmse:11609.90550
[15]	validation_0-rmse:5020.64075	validation_1-rmse:11621.18955
[16]	validation_0-rmse:4960.49363	validation_1-rmse:11648.22786
[17]	validation_0-rmse:4917.95579	validation_1-rmse:11661.42073
[18]	validation_0-rmse:4864.95097	validation_1-rmse:11612.14894
[19]	validation_0-rmse:4766.17509	validation_1-rmse:11612.73206
[20]	validation_0-rmse:4741.14973	validation_1-rmse:11598.55214
[21]	validation_0-rmse:4665.92004	validation_1-rmse:11601.79049
[22]	validation_0-rmse:4588.43864	validation_1-rmse:11607.42071
[23]	validation_0-rmse:4518.74967	validation_1-rmse:11592.33710
[24]	validation_0-rmse:4494.93263	validation_1-rmse:11597.64204
[25]	validation_0-rmse:4365.38962	validation_1-rmse:11593.78393
[26]	validation_0-rmse:4349.07234	validation_1-rmse:11561.14553
[27]	validation_0-rmse:4297.00537	validation_1-rmse:11562.03597
[28]	validation_0-rmse:4262.70024	validation_1-rmse:11595.53519
[29]	validation_0-rmse:4208.82143	validation_1-rmse:11573.89872
[30]	validation_0-rmse:4173.03240	validation_1-rmse:11564.39727
[31]	validation_0-rmse:4136.56835	validation_1-rmse:11608.18702
[32]	validation_0-rmse:4105.81737	validation_1-rmse:11679.87197
[33]	validation_0-rmse:4040.26708	validation_1-rmse:11672.73164
[34]	validation_0-rmse:3978.29881	validation_1-rmse:11672.29589
[35]	validation_0-rmse:3944.94785	validation_1-rmse:11670.91868
[36]	validation_0-rmse:3903.84378	validation_1-rmse:11667.00166
[37]	validation_0-rmse:3826.52600	validation_1-rmse:11663.81879
[38]	validation_0-rmse:3777.20044	validation_1-rmse:11662.06124
[39]	validation_0-rmse:3736.47090	validation_1-rmse:11665.76056
[40]	validation_0-rmse:3703.70287	validation_1-rmse:11670.38346
[41]	validation_0-rmse:3651.95240	validation_1-rmse:11666.90757
[42]	validation_0-rmse:3644.95348	validation_1-rmse:11675.03502
[43]	validation_0-rmse:3628.32189	validation_1-rmse:11675.67698
[44]	validation_0-rmse:3602.89573	validation_1-rmse:11673.85435
[45]	validation_0-rmse:3579.18469	validation_1-rmse:11677.71554
[46]	validation_0-rmse:3574.58358	validation_1-rmse:11677.38926
[47]	validation_0-rmse:3550.78685	validation_1-rmse:11686.87648
[48]	validation_0-rmse:3515.18256	validation_1-rmse:11698.21647
[49]	validation_0-rmse:3501.68779	validation_1-rmse:11703.51374
[50]	validation_0-rmse:3485.54525	validation_1-rmse:11701.81938
[51]	validation_0-rmse:3475.84810	validation_1-rmse:11704.13086
[52]	validation_0-rmse:3461.98443	validation_1-rmse:11705.18792
[53]	validation_0-rmse:3432.05624	validation_1-rmse:11707.28888
[54]	validation_0-rmse:3406.63698	validation_1-rmse:11707.95335
[55]	validation_0-rmse:3403.97322	validation_1-rmse:11707.69042
[56]	validation_0-rmse:3352.98727	validation_1-rmse:11668.99701
[57]	validation_0-rmse:3336.82736	validation_1-rmse:11671.72883
[58]	validation_0-rmse:3311.60893	validation_1-rmse:11678.11015
Out[104]:
XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
             interaction_constraints=None, learning_rate=None, max_bin=None,
             max_cat_threshold=None, max_cat_to_onehot=None,
             max_delta_step=None, max_depth=3, max_leaves=None,
             min_child_weight=None, missing=nan, monotone_constraints=None,
             n_estimators=1000, n_jobs=None, num_parallel_tree=None,
             predictor=None, random_state=None, ...)

We can see how features impact the prediction of total_amount.

In [105]:
#plot feature importance
plot_importance(reg, height=0.9)
Out[105]:
<AxesSubplot: title={'center': 'Feature importance'}, xlabel='F score', ylabel='Features'>

Evaluating the perfomance for baseline XGB¶

In [106]:
#evaluating on test set
predictions_xgb = reg.predict(X_test)
#calculating MAPE and RMSE
mape_xgb=mape_metrics(y_test, predictions_xgb )
rmse_xgb=rmse_metrics(y_test, predictions_xgb )

print(f'Root Mean Squared Error | RMSE: {rmse_xgb}')
print(f'Mean Absolute Percentage Error | MAPE: {mape_xgb}')
Root Mean Squared Error | RMSE: 11493.59
Mean Absolute Percentage Error | MAPE: 52.18
In [107]:
#if doing forecasting
start_date=datetime(2018,8,30)
end_date=start_date+pd.DateOffset(days=30)
full_range = pd.date_range(start=start_date, end=end_date, freq="D")
filtered_data= holiday.loc[(holiday.index >= start_date) & (holiday.index <= end_date)]
forecast_df=pd.DataFrame(data=filtered_data, index=full_range, columns=['is_holiday'])
In [108]:
#creating features for new dates
X_forcast=create_features(forecast_df)
X_forcast['is_holiday']=forecast_df['is_holiday']
In [109]:
#making forecast
xgb_forecast=reg.predict(X_forcast)
In [110]:
#creating dataframe for all the y values
y_train_df = pd.DataFrame(y_train, train_df.index, columns=['total_amount'] )
y_test_df = pd.DataFrame(y_test, test_df.index, columns=['total_amount'] )
xgb_forecast_df=pd.DataFrame(xgb_forecast, forecast_df.index, columns=['Forecast'] )
In [111]:
# predictions_xgb_df=pd.DataFrame(predictions_xgb, y_test_df.index, columns=['Predictions'] )
In [112]:
#plotting test and predictions
plot_test_predictions(y_test_df['total_amount'], predictions_xgb  )
In [113]:
#plotting forecast
plot_forecast(y_train_df['total_amount'], y_test_df['total_amount'], xgb_forecast_df['Forecast'] )

Observations:¶

  • Uponn plotting both test and predicted values we can see that the predicted series is just picking the seasonality but not the pattern within the week.
  • It looks like it has constant trend.
  • Although it is not capturing the noise but it I think beacuse of limited datapoints it is not performing well.

7.3 Tuning XG Boost and evaluating and its performance¶

Early stopping is a technique used to stop training when the loss on validation dataset starts increase (in the case of minimizing the loss). That’s why to train a model (any model, not only Xgboost) you need two separate datasets:

training data for model fitting, validation data for loss monitoring and early stopping.

In [114]:
num_estimators= [i for i in range(100, 1010, 100)]
depth=[i for i in range(3, 15, 2)]
learning=[0.0001, 0.001, 0.01, 0.1, 0.2, 0.3]
In [115]:
XGB_summary=pd.DataFrame(columns=['n_estimators', 'max_depth', 'learning_rate', 'MAPE', 'RMSE' ])

for i in num_estimators:
    for j in depth:
        for k in learning:
            model_reg= xgb.XGBRegressor(n_estimators=i, max_depth=j, learning_rate=k )
            model_reg.fit(X_train, y_train)
            predictions_xgb = model_reg.predict(X_test)
            mape_xgb=np.mean(np.abs((y_test - predictions_xgb)/y_test))*100
            rmse_xgb = np.sqrt(np.mean((y_test-predictions_xgb)**2))
            data_param={'n_estimators': i,
                        'max_depth': j,
                        'learning_rate': k,
                        'MAPE': mape_xgb,
                        'RMSE' : rmse_xgb }
            XGB_summary=pd.concat([XGB_summary, pd.DataFrame(data_param, columns=XGB_summary.columns, index=[1])], ignore_index=True)
In [116]:
XGB_summary.sort_values(by ='MAPE', ascending=True).head(10)
Out[116]:
n_estimators max_depth learning_rate MAPE RMSE
62 200 11 0.010 47.583272 12349.605940
56 200 9 0.010 47.687938 12271.166307
68 200 13 0.010 47.884805 12237.428437
50 200 7 0.010 48.130152 11923.555634
38 200 3 0.010 48.907972 12070.128655
44 200 5 0.010 49.357951 12115.267006
2 100 3 0.010 50.211866 16374.092326
325 1000 3 0.001 50.265213 16415.906460
98 300 11 0.010 50.517951 11576.892123
20 100 9 0.010 50.592040 16444.078331
In [117]:
#instantiate the XGBoost Model
reg_tuned = xgb.XGBRegressor(n_estimators=200, max_depth=11, learning_rate=0.010)
reg_tuned.fit(X_train, y_train,
        eval_set=[(X_train, y_train), (X_test, y_test)],
        early_stopping_rounds=50,
       verbose=True)
[0]	validation_0-rmse:26442.40689	validation_1-rmse:34787.93310
[1]	validation_0-rmse:26209.35688	validation_1-rmse:34463.03673
[2]	validation_0-rmse:25978.79956	validation_1-rmse:34141.76996
[3]	validation_0-rmse:25750.58389	validation_1-rmse:33823.97676
[4]	validation_0-rmse:25524.78552	validation_1-rmse:33509.75442
[5]	validation_0-rmse:25301.16760	validation_1-rmse:33198.94082
[6]	validation_0-rmse:25079.77943	validation_1-rmse:32891.58606
[7]	validation_0-rmse:24860.69992	validation_1-rmse:32587.67999
[8]	validation_0-rmse:24643.77319	validation_1-rmse:32287.11881
[9]	validation_0-rmse:24429.09747	validation_1-rmse:31989.95224
[10]	validation_0-rmse:24216.54912	validation_1-rmse:31696.06386
[11]	validation_0-rmse:24006.18467	validation_1-rmse:31405.48561
[12]	validation_0-rmse:23797.99852	validation_1-rmse:31118.18460
[13]	validation_0-rmse:23591.86642	validation_1-rmse:30834.10295
[14]	validation_0-rmse:23387.87887	validation_1-rmse:30553.24785
[15]	validation_0-rmse:23185.99936	validation_1-rmse:30275.55099
[16]	validation_0-rmse:22986.31690	validation_1-rmse:30001.02589
[17]	validation_0-rmse:22788.61823	validation_1-rmse:29729.60046
[18]	validation_0-rmse:22592.93534	validation_1-rmse:29461.28626
[19]	validation_0-rmse:22399.44787	validation_1-rmse:29196.03476
[20]	validation_0-rmse:22207.98835	validation_1-rmse:28933.82858
[21]	validation_0-rmse:22018.54822	validation_1-rmse:28674.63562
[22]	validation_0-rmse:21831.03774	validation_1-rmse:28417.34174
[23]	validation_0-rmse:21645.51591	validation_1-rmse:28163.03740
[24]	validation_0-rmse:21461.91231	validation_1-rmse:27912.70522
[25]	validation_0-rmse:21280.00500	validation_1-rmse:27664.26236
[26]	validation_0-rmse:21100.10934	validation_1-rmse:27419.72236
[27]	validation_0-rmse:20921.87957	validation_1-rmse:27177.03914
[28]	validation_0-rmse:20745.52585	validation_1-rmse:26938.19038
[29]	validation_0-rmse:20570.94945	validation_1-rmse:26701.16711
[30]	validation_0-rmse:20398.04795	validation_1-rmse:26466.95706
[31]	validation_0-rmse:20227.06875	validation_1-rmse:26236.44456
[32]	validation_0-rmse:20057.78653	validation_1-rmse:26007.74784
[33]	validation_0-rmse:19890.32705	validation_1-rmse:25781.16257
[34]	validation_0-rmse:19724.31002	validation_1-rmse:25557.86221
[35]	validation_0-rmse:19560.00268	validation_1-rmse:25338.17908
[36]	validation_0-rmse:19397.62813	validation_1-rmse:25119.55491
[37]	validation_0-rmse:19236.92461	validation_1-rmse:24904.19614
[38]	validation_0-rmse:19077.88207	validation_1-rmse:24690.78116
[39]	validation_0-rmse:18919.92530	validation_1-rmse:24480.58844
[40]	validation_0-rmse:18763.63192	validation_1-rmse:24272.29462
[41]	validation_0-rmse:18608.88879	validation_1-rmse:24067.18176
[42]	validation_0-rmse:18455.49952	validation_1-rmse:23864.60373
[43]	validation_0-rmse:18304.13079	validation_1-rmse:23663.79853
[44]	validation_0-rmse:18153.55303	validation_1-rmse:23466.16882
[45]	validation_0-rmse:18004.90437	validation_1-rmse:23270.26841
[46]	validation_0-rmse:17857.70871	validation_1-rmse:23077.50240
[47]	validation_0-rmse:17711.82498	validation_1-rmse:22888.13969
[48]	validation_0-rmse:17567.59839	validation_1-rmse:22700.15545
[49]	validation_0-rmse:17424.80253	validation_1-rmse:22550.82469
[50]	validation_0-rmse:17283.21041	validation_1-rmse:22368.35618
[51]	validation_0-rmse:17143.25920	validation_1-rmse:22187.30562
[52]	validation_0-rmse:17004.70458	validation_1-rmse:22009.43614
[53]	validation_0-rmse:16867.36499	validation_1-rmse:21867.75564
[54]	validation_0-rmse:16731.70784	validation_1-rmse:21727.95619
[55]	validation_0-rmse:16596.89751	validation_1-rmse:21554.90482
[56]	validation_0-rmse:16463.78455	validation_1-rmse:21417.84817
[57]	validation_0-rmse:16331.65076	validation_1-rmse:21250.72184
[58]	validation_0-rmse:16201.02451	validation_1-rmse:21085.01305
[59]	validation_0-rmse:16071.52527	validation_1-rmse:20922.20417
[60]	validation_0-rmse:15943.50136	validation_1-rmse:20792.73119
[61]	validation_0-rmse:15816.82150	validation_1-rmse:20665.01793
[62]	validation_0-rmse:15691.31325	validation_1-rmse:20506.72100
[63]	validation_0-rmse:15567.04102	validation_1-rmse:20381.83116
[64]	validation_0-rmse:15443.99639	validation_1-rmse:20227.56692
[65]	validation_0-rmse:15321.85240	validation_1-rmse:20106.89217
[66]	validation_0-rmse:15201.10348	validation_1-rmse:19958.11693
[67]	validation_0-rmse:15081.59753	validation_1-rmse:19839.91768
[68]	validation_0-rmse:14963.16697	validation_1-rmse:19723.37067
[69]	validation_0-rmse:14846.09879	validation_1-rmse:19580.30464
[70]	validation_0-rmse:14730.26958	validation_1-rmse:19467.76560
[71]	validation_0-rmse:14615.08568	validation_1-rmse:19327.38623
[72]	validation_0-rmse:14501.21882	validation_1-rmse:19218.49154
[73]	validation_0-rmse:14388.36989	validation_1-rmse:19109.88533
[74]	validation_0-rmse:14276.69872	validation_1-rmse:18976.00033
[75]	validation_0-rmse:14166.19503	validation_1-rmse:18870.46572
[76]	validation_0-rmse:14056.80944	validation_1-rmse:18738.45732
[77]	validation_0-rmse:13948.65031	validation_1-rmse:18609.93224
[78]	validation_0-rmse:13841.30553	validation_1-rmse:18509.66669
[79]	validation_0-rmse:13734.98449	validation_1-rmse:18384.60054
[80]	validation_0-rmse:13629.57902	validation_1-rmse:18287.22974
[81]	validation_0-rmse:13525.27901	validation_1-rmse:18164.63157
[82]	validation_0-rmse:13421.92806	validation_1-rmse:18070.52467
[83]	validation_0-rmse:13319.52465	validation_1-rmse:17977.02946
[84]	validation_0-rmse:13218.55327	validation_1-rmse:17859.24701
[85]	validation_0-rmse:13118.26889	validation_1-rmse:17768.83911
[86]	validation_0-rmse:13018.95249	validation_1-rmse:17655.67929
[87]	validation_0-rmse:12920.64227	validation_1-rmse:17567.62619
[88]	validation_0-rmse:12823.30537	validation_1-rmse:17481.21680
[89]	validation_0-rmse:12726.91391	validation_1-rmse:17371.29177
[90]	validation_0-rmse:12631.37714	validation_1-rmse:17281.04926
[91]	validation_0-rmse:12536.85551	validation_1-rmse:17174.26383
[92]	validation_0-rmse:12443.29047	validation_1-rmse:17093.05868
[93]	validation_0-rmse:12350.15144	validation_1-rmse:17006.87951
[94]	validation_0-rmse:12258.09208	validation_1-rmse:16903.27040
[95]	validation_0-rmse:12166.96999	validation_1-rmse:16819.76365
[96]	validation_0-rmse:12076.75682	validation_1-rmse:16721.71515
[97]	validation_0-rmse:11987.44686	validation_1-rmse:16640.81315
[98]	validation_0-rmse:11899.03491	validation_1-rmse:16539.02730
[99]	validation_0-rmse:11811.05994	validation_1-rmse:16444.07839
[100]	validation_0-rmse:11723.94070	validation_1-rmse:16363.08833
[101]	validation_0-rmse:11637.72846	validation_1-rmse:16287.77211
[102]	validation_0-rmse:11552.37156	validation_1-rmse:16196.95429
[103]	validation_0-rmse:11467.99980	validation_1-rmse:16119.64984
[104]	validation_0-rmse:11384.36999	validation_1-rmse:16025.07425
[105]	validation_0-rmse:11301.44082	validation_1-rmse:15950.26797
[106]	validation_0-rmse:11219.43078	validation_1-rmse:15880.50572
[107]	validation_0-rmse:11138.22010	validation_1-rmse:15796.29055
[108]	validation_0-rmse:11057.61760	validation_1-rmse:15725.09281
[109]	validation_0-rmse:10977.98371	validation_1-rmse:15659.01188
[110]	validation_0-rmse:10898.70991	validation_1-rmse:15581.78830
[111]	validation_0-rmse:10820.39491	validation_1-rmse:15514.06740
[112]	validation_0-rmse:10742.65640	validation_1-rmse:15440.82507
[113]	validation_0-rmse:10666.17956	validation_1-rmse:15382.42824
[114]	validation_0-rmse:10589.39176	validation_1-rmse:15355.46163
[115]	validation_0-rmse:10513.52326	validation_1-rmse:15285.41233
[116]	validation_0-rmse:10438.25301	validation_1-rmse:15259.22429
[117]	validation_0-rmse:10363.90123	validation_1-rmse:15218.76998
[118]	validation_0-rmse:10290.95735	validation_1-rmse:15145.17038
[119]	validation_0-rmse:10217.63619	validation_1-rmse:15120.02686
[120]	validation_0-rmse:10144.96168	validation_1-rmse:15081.18798
[121]	validation_0-rmse:10073.02868	validation_1-rmse:15056.57990
[122]	validation_0-rmse:10002.05458	validation_1-rmse:14992.92486
[123]	validation_0-rmse:9931.40340	validation_1-rmse:14969.03352
[124]	validation_0-rmse:9861.74419	validation_1-rmse:14932.18266
[125]	validation_0-rmse:9792.59475	validation_1-rmse:14872.02306
[126]	validation_0-rmse:9723.93520	validation_1-rmse:14849.13555
[127]	validation_0-rmse:9656.96078	validation_1-rmse:14784.10380
[128]	validation_0-rmse:9590.41285	validation_1-rmse:14749.39955
[129]	validation_0-rmse:9523.40954	validation_1-rmse:14727.43322
[130]	validation_0-rmse:9457.36124	validation_1-rmse:14705.22301
[131]	validation_0-rmse:9392.41349	validation_1-rmse:14643.43824
[132]	validation_0-rmse:9327.36778	validation_1-rmse:14610.02508
[133]	validation_0-rmse:9262.75457	validation_1-rmse:14588.66120
[134]	validation_0-rmse:9199.92630	validation_1-rmse:14529.46199
[135]	validation_0-rmse:9136.54004	validation_1-rmse:14508.65367
[136]	validation_0-rmse:9073.77204	validation_1-rmse:14488.00872
[137]	validation_0-rmse:9011.66312	validation_1-rmse:14456.52299
[138]	validation_0-rmse:8950.63522	validation_1-rmse:14388.22913
[139]	validation_0-rmse:8889.89936	validation_1-rmse:14368.44752
[140]	validation_0-rmse:8829.48412	validation_1-rmse:14348.82200
[141]	validation_0-rmse:8769.64957	validation_1-rmse:14329.35112
[142]	validation_0-rmse:8710.51041	validation_1-rmse:14299.73420
[143]	validation_0-rmse:8652.33916	validation_1-rmse:14236.87844
[144]	validation_0-rmse:8594.14828	validation_1-rmse:14218.19770
[145]	validation_0-rmse:8536.16948	validation_1-rmse:14199.50159
[146]	validation_0-rmse:8478.78551	validation_1-rmse:14171.19810
[147]	validation_0-rmse:8422.53226	validation_1-rmse:14111.05759
[148]	validation_0-rmse:8366.80395	validation_1-rmse:14052.01183
[149]	validation_0-rmse:8311.67538	validation_1-rmse:13994.04734
[150]	validation_0-rmse:8257.05180	validation_1-rmse:13937.15073
[151]	validation_0-rmse:8202.28618	validation_1-rmse:13911.84688
[152]	validation_0-rmse:8147.90859	validation_1-rmse:13860.61095
[153]	validation_0-rmse:8094.59299	validation_1-rmse:13806.14647
[154]	validation_0-rmse:8041.49040	validation_1-rmse:13752.70445
[155]	validation_0-rmse:7989.09215	validation_1-rmse:13700.27097
[156]	validation_0-rmse:7936.08901	validation_1-rmse:13648.83326
[157]	validation_0-rmse:7883.43304	validation_1-rmse:13597.58855
[158]	validation_0-rmse:7831.96698	validation_1-rmse:13547.32636
[159]	validation_0-rmse:7780.57117	validation_1-rmse:13509.75361
[160]	validation_0-rmse:7729.44356	validation_1-rmse:13461.16210
[161]	validation_0-rmse:7679.26502	validation_1-rmse:13417.57930
[162]	validation_0-rmse:7629.94661	validation_1-rmse:13370.65577
[163]	validation_0-rmse:7579.86141	validation_1-rmse:13324.65789
[164]	validation_0-rmse:7530.28642	validation_1-rmse:13279.57262
[165]	validation_0-rmse:7481.88049	validation_1-rmse:13238.39473
[166]	validation_0-rmse:7434.37414	validation_1-rmse:13194.90050
[167]	validation_0-rmse:7385.94608	validation_1-rmse:13152.28461
[168]	validation_0-rmse:7337.90167	validation_1-rmse:13120.30512
[169]	validation_0-rmse:7290.29101	validation_1-rmse:13088.88966
[170]	validation_0-rmse:7243.30999	validation_1-rmse:13048.37841
[171]	validation_0-rmse:7197.05635	validation_1-rmse:13009.36187
[172]	validation_0-rmse:7150.78157	validation_1-rmse:12977.69068
[173]	validation_0-rmse:7104.66936	validation_1-rmse:12953.15663
[174]	validation_0-rmse:7058.99607	validation_1-rmse:12929.00534
[175]	validation_0-rmse:7014.27894	validation_1-rmse:12891.75835
[176]	validation_0-rmse:6969.33258	validation_1-rmse:12874.22243
[177]	validation_0-rmse:6924.65448	validation_1-rmse:12850.70751
[178]	validation_0-rmse:6881.04893	validation_1-rmse:12815.17283
[179]	validation_0-rmse:6836.81394	validation_1-rmse:12791.26579
[180]	validation_0-rmse:6793.07659	validation_1-rmse:12774.96141
[181]	validation_0-rmse:6750.38393	validation_1-rmse:12741.07349
[182]	validation_0-rmse:6707.18895	validation_1-rmse:12725.32849
[183]	validation_0-rmse:6665.25414	validation_1-rmse:12692.58440
[184]	validation_0-rmse:6622.79948	validation_1-rmse:12670.40666
[185]	validation_0-rmse:6580.84120	validation_1-rmse:12649.82559
[186]	validation_0-rmse:6538.95425	validation_1-rmse:12635.21367
[187]	validation_0-rmse:6498.38443	validation_1-rmse:12604.43963
[188]	validation_0-rmse:6457.16614	validation_1-rmse:12583.59020
[189]	validation_0-rmse:6417.14080	validation_1-rmse:12554.43821
[190]	validation_0-rmse:6376.47505	validation_1-rmse:12540.82844
[191]	validation_0-rmse:6336.70049	validation_1-rmse:12512.16896
[192]	validation_0-rmse:6297.19847	validation_1-rmse:12487.46011
[193]	validation_0-rmse:6257.52845	validation_1-rmse:12476.93487
[194]	validation_0-rmse:6218.25615	validation_1-rmse:12466.67714
[195]	validation_0-rmse:6179.37799	validation_1-rmse:12441.48705
[196]	validation_0-rmse:6141.09577	validation_1-rmse:12415.26651
[197]	validation_0-rmse:6103.26060	validation_1-rmse:12397.83363
[198]	validation_0-rmse:6065.61090	validation_1-rmse:12373.72518
[199]	validation_0-rmse:6029.12181	validation_1-rmse:12349.60594
Out[117]:
XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
             interaction_constraints=None, learning_rate=0.01, max_bin=None,
             max_cat_threshold=None, max_cat_to_onehot=None,
             max_delta_step=None, max_depth=11, max_leaves=None,
             min_child_weight=None, missing=nan, monotone_constraints=None,
             n_estimators=200, n_jobs=None, num_parallel_tree=None,
             predictor=None, random_state=None, ...)
In [118]:
# #evaluating on test set
predictions_xgb_tuned = reg_tuned.predict(X_test)
#calculating MAPE and RMSE
mape_xgb_tuned=mape_metrics(y_test, predictions_xgb_tuned )
rmse_xgb_tuned=rmse_metrics(y_test, predictions_xgb_tuned )

print(f'Root Mean Squared Error | RMSE: {rmse_xgb_tuned}')
print(f'Mean Absolute Percentage Error | MAPE: {mape_xgb_tuned}')
Root Mean Squared Error | RMSE: 12349.61
Mean Absolute Percentage Error | MAPE: 47.58
In [119]:
#plot feature importance
plot_importance(reg_tuned, height=0.9)
Out[119]:
<AxesSubplot: title={'center': 'Feature importance'}, xlabel='F score', ylabel='Features'>
In [120]:
#plotting test and predictions
plot_test_predictions(y_test_df['total_amount'], predictions_xgb_tuned)

Observations:¶

  • After tuning, it is able to capture the peaks of within week.
  • The MAPE obtained is very good but I am still reluctant to choose it as the best model. Beacuse form the visual inspecation I see that it is not showing any trend. Also, the RMSE is more than the baseline XGBoost.
  • Also, since we have a limited data I am assuming that it is not learning enough.
  • Regression trees cannot predict values that it has not seen in the training data is something that need to be considered when there is a special sale period.

8. Modelling (LSTM)¶

We can use univariate LSTM to predict the next value in out time series. We will use only one step prediction, meaning we will use only one previous observation to predict the next value. Using this we will have only one feature. This one step window is highly important when we want to forecast the value for next day. This is not the case with our sales forecasting for ecommerce, yet we want to try this way.

Like previous models of SARIMA, FB Prophet and XG Boost, LSTM also takes the input in certain format. It takes a series of data with number of observations X number of timesteps X number of features.

8.1 Preparing data for LSTM¶

We need to scale the data in order to apply it to LSTM model.

In [121]:
#scaling the date
sc = MinMaxScaler(feature_range = (0,1))
training_scaled = sc.fit_transform(train_df)
test_scaled = sc.transform(test_df)
fig, (ax0,ax1) = plt.subplots(nrows=1, ncols=2, figsize=(20,8))
ax0.set_title('Original Distributions')
ax1.set_title('Distributions after MinMax scaler')
ax0.plot(train_df['total_amount'])
ax1.plot(pd.DataFrame(training_scaled,columns=['total_amount'], index=train_df.index)['total_amount'])
Out[121]:
[<matplotlib.lines.Line2D at 0x1d49266df40>]
In [122]:
#define the lookback function for creating y

def lookback(df, window):
    X = list()
    Y = list()

    for i in range(window,len(df)):
        X.append(df[i-window:i, 0])
        Y.append(df[i,0])
    
    X,Y = np.array(X),np.array(Y)
    return X, Y
In [123]:
#creating X and y using lookback function
X_train_lstm, y_train_lstm = lookback(training_scaled, 1)
X_test_lstm, y_test_lstm = lookback(test_scaled, 1)
In [124]:
#checking shape of X
X_train_lstm.shape
Out[124]:
(484, 1)
In [125]:
#reshaping X for LSTM
#number of observations, timesteps and features
X_train_lstm = np.reshape(X_train_lstm, (X_train_lstm.shape[0],X_train_lstm.shape[1],1))
X_test_lstm = np.reshape(X_test_lstm, (X_test_lstm.shape[0],X_test_lstm.shape[1],1))
In [126]:
X_train_lstm.shape
Out[126]:
(484, 1, 1)

8.2 Baseline model on daily data and its performance¶

Let us apply the LSM model with three layers. The starting layer has 100 nodes, second layer has 50 nodes and outut layer has 1 node. We will be using Keras Early stopping function to prevent it from overfitting. We have also chosen the batch size equal to 1

In [127]:
# random seeds for reproducibility
np.random.seed(123)
tf.random.set_seed(123)

model = Sequential()
# The input of LSTM layer has a shape of (num_timesteps, num_features)
model.add(LSTM(100, return_sequences=True, activation='relu', input_shape=(X_train_lstm.shape[1], X_train_lstm.shape[2])))
model.add(Dropout(0.2))

model.add(LSTM(50, activation='relu' ))
model.add(Dropout(0.2))
          
model.add(Dense(1))

model.compile(loss='mean_squared_error', optimizer='adam')

model.summary()

history = model.fit(X_train_lstm, y_train_lstm, epochs=50, batch_size=16, shuffle=False,
                    validation_data=(X_test_lstm, y_test_lstm), verbose=1,
                    callbacks = [EarlyStopping(monitor='val_loss', mode='min', patience=20, verbose=1)])
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 lstm (LSTM)                 (None, 1, 100)            40800     
                                                                 
 dropout (Dropout)           (None, 1, 100)            0         
                                                                 
 lstm_1 (LSTM)               (None, 50)                30200     
                                                                 
 dropout_1 (Dropout)         (None, 50)                0         
                                                                 
 dense (Dense)               (None, 1)                 51        
                                                                 
=================================================================
Total params: 71,051
Trainable params: 71,051
Non-trainable params: 0
_________________________________________________________________
Epoch 1/50
31/31 [==============================] - 5s 28ms/step - loss: 0.0103 - val_loss: 0.0101
Epoch 2/50
31/31 [==============================] - 0s 8ms/step - loss: 0.0054 - val_loss: 0.0056
Epoch 3/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0065 - val_loss: 0.0059
Epoch 4/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0062 - val_loss: 0.0058
Epoch 5/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0061 - val_loss: 0.0055
Epoch 6/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0059 - val_loss: 0.0053
Epoch 7/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0058 - val_loss: 0.0050
Epoch 8/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0058 - val_loss: 0.0047
Epoch 9/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0055 - val_loss: 0.0044
Epoch 10/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0054 - val_loss: 0.0041
Epoch 11/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0051 - val_loss: 0.0038
Epoch 12/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0051 - val_loss: 0.0037
Epoch 13/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0047 - val_loss: 0.0033
Epoch 14/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0047 - val_loss: 0.0031
Epoch 15/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0045 - val_loss: 0.0029
Epoch 16/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0045 - val_loss: 0.0030
Epoch 17/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0043 - val_loss: 0.0028
Epoch 18/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0043 - val_loss: 0.0028
Epoch 19/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0042 - val_loss: 0.0027
Epoch 20/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0041 - val_loss: 0.0026
Epoch 21/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0039 - val_loss: 0.0026
Epoch 22/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0044 - val_loss: 0.0028
Epoch 23/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0037 - val_loss: 0.0025
Epoch 24/50
31/31 [==============================] - 0s 8ms/step - loss: 0.0037 - val_loss: 0.0025
Epoch 25/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0041 - val_loss: 0.0026
Epoch 26/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0037 - val_loss: 0.0025
Epoch 27/50
31/31 [==============================] - 0s 7ms/step - loss: 0.0037 - val_loss: 0.0024
Epoch 28/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0040 - val_loss: 0.0024
Epoch 29/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0037 - val_loss: 0.0024
Epoch 30/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0035 - val_loss: 0.0024
Epoch 31/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0035 - val_loss: 0.0024
Epoch 32/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0038 - val_loss: 0.0024
Epoch 33/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0033 - val_loss: 0.0024
Epoch 34/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0033 - val_loss: 0.0023
Epoch 35/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0035 - val_loss: 0.0024
Epoch 36/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0033 - val_loss: 0.0023
Epoch 37/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0033 - val_loss: 0.0023
Epoch 38/50
31/31 [==============================] - 0s 5ms/step - loss: 0.0033 - val_loss: 0.0023
Epoch 39/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0032 - val_loss: 0.0023
Epoch 40/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0032 - val_loss: 0.0023
Epoch 41/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0030 - val_loss: 0.0023
Epoch 42/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0033 - val_loss: 0.0023
Epoch 43/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0032 - val_loss: 0.0023
Epoch 44/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0030 - val_loss: 0.0023
Epoch 45/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0031 - val_loss: 0.0023
Epoch 46/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0033 - val_loss: 0.0023
Epoch 47/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0032 - val_loss: 0.0023
Epoch 48/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0029 - val_loss: 0.0024
Epoch 49/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0034 - val_loss: 0.0023
Epoch 50/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0033 - val_loss: 0.0023
In [128]:
#calaculating predictions
prediction_lstm = model.predict(X_test_lstm)
4/4 [==============================] - 0s 3ms/step

Predictions and evaluation using LSTM¶

In [129]:
#predicting and inversing the predictions to original time scale.

prediction_inverse = sc.inverse_transform(prediction_lstm.reshape(-1, 1))
y_test_inverse = sc.inverse_transform(y_test_lstm.reshape(-1, 1))

prediction2_inverse = np.array(prediction_inverse[:,0][1:])
y_test2_inverse = np.array(y_test_inverse[:,0])
In [130]:
#Evaluating the LSTM model.

y_test2_inverse_without_last = y_test2_inverse[:-1]
rmse_lstm = rmse_metrics(y_test2_inverse_without_last, prediction2_inverse)
mape_lstm = mape_metrics(y_test2_inverse_without_last, prediction2_inverse)
print(f"Root Mean Squared Error | RMSE: {rmse_lstm}")
print(f"Mean Absolute Percentage Error | MAPE: {mape_lstm}")
Root Mean Squared Error | RMSE: 3638.94
Mean Absolute Percentage Error | MAPE: 9.94
In [131]:
#plotting performance of the model
epochs = range(1, 51)

plt.figure(figsize=(15, 5))
plt.plot(epochs, history.history["loss"], label="training", marker="o")
plt.plot(epochs, history.history["val_loss"], label="validation", marker="o")
plt.xlabel("Epochs")
plt.xticks(epochs)
plt.ylabel("Loss")
plt.legend()
plt.show()

Looking at the training metrics, we see that the validation and training loss decrease in parallel (so we are not overfitting):

  • The plot of training loss decreases to a point of stability.
  • The plot of validation loss decreases to a point of stability and has a small gap with the training loss.

Plotting the test and prediction values¶

In [132]:
test_df[: -2].index
y_test2_inverse_without_last=pd.DataFrame(y_test2_inverse_without_last, columns=['total_amount'], index= test_df[: -2].index)
In [133]:
plot_test_predictions(y_test2_inverse_without_last['total_amount'], prediction2_inverse )

Note: We have applied only one step time series forecasting which predicts the observation at the next time step.

Although we are getting very good result but still I am reluctant to proceed ahead with this model because we have limited sales data. We need to get test it on more data to see how it performs.

9. Discussing issues encountered with hourly sampled data¶

In an attempt to increase the data points, I tried to resample the dataset at hourly level and discovered that I got a lot of zero values at certain time of day because no order was placed during that time. I tried applying SAIMA model but it resulted in negative predictions and decreasing trend. Upon further reading and consulting, found that we will need to do some transfromations or apply differnt approaches to handle such data. Therfore, I limited myself to the daily data only.

10.Conclusion¶

Here is the summary of all the modela we tested:

Summary:¶

Model RMSE MAPE
SARIMA(1,1,1)(0,1,1)(7) 13810.59 68.99
SARIMAX(1,1,2)(0,1,1)(7) Including impact of holidays 13312.72 65.66
Baseline Prophet 27437.77 71.78
Baseline Prophet with holiday 15393.64 77.88
Tuned Prophet with holiday 12142.69 51.45
XGBoost Regression including Holiday 11493.59 52.18
Tuned XGBoost Regression including Holiday 12349.61 47.58
LSTM (one step Prediction) 2803.14 9.37
In [134]:
#plotting all the models together
fig = go.Figure()
fig.add_trace(go.Scatter(x=train_df.index, y=train_df['total_amount'].values, mode='lines', name="Train"))
fig.add_trace(go.Scatter(x=test_df.index, y=test_df['total_amount'].values, mode='lines', name="Test"))
fig.add_trace(go.Scatter(x=test_df.index, y=sarimax_predictions.values, mode='lines', name="SARIMAX"))
fig.add_trace(go.Scatter(x=test_df.index, y=forecast_tuned[-121:]['yhat'].values, mode='lines', name="FB Prophet tuned"))
fig.add_trace(go.Scatter(x=test_df.index, y=predictions_xgb_tuned, mode='lines', name="XGB Tuned"))
fig.add_trace(go.Scatter(x=test_df.index[: -2], y=prediction2_inverse, mode='lines', name="LSTM"))

fig.update_xaxes(rangeslider_visible=True)
fig.update_layout(
    yaxis_title="Revenue amount", 
    xaxis_title="Date",
    title="Daily Sales amount and forecast using different models"
)
# fig.write_html("result.html")
fig.show()
In [135]:
#plotting all the models together
fig = go.Figure()
# fig.add_trace(go.Scatter(x=train_df.index, y=train_df['total_amount'].values, mode='lines', name="Train"))
fig.add_trace(go.Scatter(x=test_df.index, y=test_df['total_amount'].values, mode='lines', name="Test"))
fig.add_trace(go.Scatter(x=test_df.index, y=sarimax_predictions.values, mode='lines', name="SARIMAX"))
fig.add_trace(go.Scatter(x=test_df.index, y=forecast_tuned[-121:]['yhat'].values, mode='lines', name="FB Prophet tuned"))
fig.add_trace(go.Scatter(x=test_df.index, y=predictions_xgb_tuned, mode='lines', name="XGB Tuned"))
fig.add_trace(go.Scatter(x=test_df.index[: -2], y=prediction2_inverse, mode='lines', name="LSTM"))

fig.update_xaxes(rangeslider_visible=True)
fig.update_layout(
    yaxis_title="Revenue amount", 
    xaxis_title="Date",
    title="Daily Sales amount and forecast using different models"
)
fig.write_html("result.html")
fig.show()

We will be choosing the FB Prophet as the best performing model because it is able to achieve considerable MAPE. It was able to capture the trend, seasonality and peaks within week and month.

On the other hand tuned XGB also gave good result but upon plotting the graph we did not see it capturing the peaks within week and trend. For LSTM we may need to do futher resarch and try testing it with more data points to be able to proceed ahead with it.

In [132]:
#saving the model as pickle file

import joblib

joblib.dump(fb_tuned, "forecasting_model.pkl")
Out[132]:
['forecasting_model.pkl']